There is a newer version of this book.
This book teaches your students how to use Python for data analysis. It starts by showing how to use Pandas for data analysis, Seaborn for data visualization, and JupyterLab as your IDE. It gives your students a thorough course in descriptive analysis and an introductory course in predictive analysis. And it ties all of the skills together by presenting 4 real-world case studies…political, environmental, social, and sports analytics.
The Canvas course file contains all the objectives, quizzes, assignments, and slides that you need to run an effective course. It only takes a few clicks to import it into the Canvas LMS. Then, you can customize it for your course. Learn more.
The text was perfect for my class. It provided a solid foundation for my students in using the Pandas and Seaborn libraries. I really appreciated the four case studies. They were a big help for my students as they illustrated all phases of data analysis and visualization.”
To present the essential Python and data analysis skills in a manageable progression and at the right pace, this book is divided into 4 sections.
Section 1 consists of 4 chapters that help you get your students started with data analysis as quickly and effectively as possible.
Here, they’ll learn how to use JupyterLab and Jupyter Notebooks to organize and develop their analyses. They’ll learn how to use a subset of the Pandas module for data analysis and visualization. And they’ll learn how to use the Seaborn module to create professional data visualizations that can be used for presentations.
By the end of chapter 4, they’ll be able to start doing analyses of their own.
The 5 chapters in section 2 present the critical skills needed for data analysis. That includes:
By the end of this section, your students will have a solid set of the descriptive analysis skills that are needed in a wide variety of fields.
Although a full treatment of predictive analysis is beyond the scope of any first course in data analysis, we believe that all students should at least understand the basic concepts. So that’s the goal of this two-chapter section.
First, chapter 10 shows how to find the correlations between variables, how to use Scikit-learn to work with simple linear regression models, and how to use Seaborn to create and plot various types of linear regression models.
Then, chapter 11 shows how to create and use multiple regression models, how to create and rescale dummy variables, and how to use Scikit-learn to not only select the right variables, but also the right number of variables for multiple regressions. These, of course, are the critical concepts and skills for doing an effective job of predictive analysis.
This section presents 4 case studies that show how the skills in this book can be applied to real-world datasets:
Frankly, you can’t master data analysis by working with toy datasets, and these case studies help ensure that your students will master data analysis at a professional level.
The book assumes that the students have some programming experience, the kind they would get from any introduction to programming course. Then, chapter 1 presents the minimal set of Python skills that are required for this book: how to import modules; how to call and chain methods; how to code lists, slices, tuples, and dictionaries; and how to continue statements over two lines. For the times when your students need to know more than that, they can use Murach’s Python Programming as a reference.
The only software that’s needed for this book is the Anaconda distribution of Python. It includes JupyterLab, Pandas, Seaborn, Scikit-learn, and more.
Appendixes A and B show how to download and install this distribution on both Windows and macOS systems. Then, chapter 1 shows how to get started with JupyterLab.
As we see it, this is the best primary text for any course in which the focus is on the use of Python for data analysis. But it is also the ideal supplementary text for a general course on data analysis because it shows how to use Python to apply the concepts and statistical methods to real-world datasets.
Like all of our books, this one has features that you won’t find in any competing book. Here are just three of them:
“This is my first exposure to Murach’s books, and I love them. I like the organization of the content, the consistent approach in each book, and the accuracy of the material.”
—Bob L., Michigan
“I really like the paired-pages format of detailed information on the left and quick notes on the right. This helps me to quickly find the information I’m looking for.”
—Roxanne T., Student, Washington
“I can’t praise this book highly enough. The clarity used in picking what to include, when to introduce it, and how to do so is remarkable.”
—Charles Ferguson, Software Developer, Australia
“Another thing I like is the exercises at the end of each chapter. They’re a great way to reinforce the main points of each chapter and force you to get your hands dirty.”
—Hien Luu, SD Forum/Java SIG
“Your book was indispensable to me. The answers were right there at every turn. All the examples made sense, and they all worked!”
—Alan Vogt, ETL Consultant, Massachusetts
“This book covers the perfect amount of description, and it does not make you bored by providing unnecessary details.”
—Posted at an online bookseller
On Murach’s Python Programming: “This is now my third book for Python, and it is the ONLY one that has made me feel comfortable solving problems and reading code. The paired pages approach is fantastic, and it makes learning the syntax, rules, and conventions understandable for me.”
—Posted at an online bookseller
“Your books shine out from the rest—the quality of writing and presentation of information is topnotch, and the consistency of quality across books is impressive.”
—Nolan Tamashiro, Developer
View the table of contents for this book in a PDF: Table of Contents (PDF)
Click on any chapter title to display or hide its content.
What data analysis is
The five phases of data analysis and visualization
The IDEs for Python data analysis
How to install and import the Python modules for data analysis
How to call and chain methods
The coding basics for Python data analysis
How to start JupyterLab and work with a Notebook
How to edit and run the cells in a Notebook
How to use the Tab completion and tooltip features
How syntax and runtime errors work
How to use Markdown language
How to get reference information
How to split the screen between two Notebooks
How to use Magic Commands
The Polling case study
The Forest Fires case study
The Social Survey case study
The Sports Analytics case study
The DataFrame structure
Two ways to get data into a DataFrame
How to save and restore a DataFrame
How to display the data in a DataFrame
How to use the attributes of a DataFrame
How to use the info(), nunique(), and describe() methods
How to access columns
How to access rows
How to access a subset of rows and columns
Another way to access a subset of rows and columns
How to sort the data
How to use the statistical methods
How to use Python for column arithmetic
How to modify the string data in columns
How to use indexes
How to pivot the data
How to melt the data
How to group the data
How to aggregate the data
How to plot the data
The Python libraries for data visualization
Long vs. wide data for data visualization
How the Pandas plot() method works by default
The three basic parameters for the Pandas plot() method
How to create a line plot or an area plot
How to create a scatter plot
How to create a bar plot
How to create a histogram or a density plot
How to create a box plot or a pie plot
How to improve the appearance of a plot
How to work with subplots
How to use chaining to get the plots you want
The Seaborn methods for plotting
The general methods vs. the specific methods
How to use the basic Seaborn parameters
How to use the Seaborn parameters for working with subplots
How to set the title, x label, and y label
How to set the ticks, x limits, and y limits
How to set the background style
How to work with subplots
How to save a plot
How to create a line plot
How to create a scatter plot
How to create a bar plot
How to create a box plot
How to create a histogram
How to create a KDE or ECDF plot
How to enhance a distribution plot
How to use other Axes methods to enhance a plot
How to annotate a plot
How to set the color palette
How to enhance a plot that has subplots
How to customize the titles for subplots
How to set the size of a specific plot
Common data sources
How to find and select the data that you want
How to import data directly into a DataFrame
How to download a file to disk before importing it
How to work with a zip file on disk
How to run queries against a database
How to use a SQL query to import data into a DataFrame
How to get and explore the metadata of a Stata file
How to build DataFrames for the metadata and the data
How to download a JSON file to disk
How to open a JSON file in JupyterLab
How to drill down into the data
How to build a DataFrame for the data
A general plan for cleaning the data
What the info() method can tell you
What the unique values can tell you
What the value counts can tell you
How to drop rows based on conditions
How to drop duplicate rows
How to drop columns
How to rename columns
How to find missing values
How to drop rows with missing values
How to fill missing values
How to find dates and numbers that are imported as objects
How to convert date and time strings to the datetime data type
How to convert object columns to numeric data types
How to work with the category data type
How to replace invalid values and convert a column’s data type
How to fix data problems when you import the data
How to find outliers
How to fix outliers
How to work with datetime columns
How to work with string columns
How to work with numeric columns
How to add a summary column to a DataFrame
How to apply functions to rows or columns
How to apply user-defined functions
How lambda expressions work with DataFrames
How to apply lambda expressions
How to set and remove an index
How to unstack indexed data
How to join DataFrames with an inner join
How to join DataFrames with a left or outer join
How to merge DataFrames
How to concatenate DataFrames
What the warning is telling you
What to do when the warning is displayed
What to watch for when the warning isn’t displayed
How to melt columns to create long data
How to plot melted columns
How to group and apply a single aggregate method
How to work with a DataFrameGroupBy object
How to apply multiple aggregate methods
How to use the pivot() method
How to use the pivot_table() method
How to create bins of equal size
How to create bins with equal numbers of values
How to plot binned data
How to select the rows with the largest values
How to calculate the percent change
How to rank rows
How to find other methods for analysis
How to generate time periods
How to reindex with datetime indexes
How to reindex with a semi-month index
How a user-defined function can improve a datetime index
How reindexing with an improved index can improve plots
How to use the resample() method
How to use the label and closed parameters when you downsample
How downsampling can improve plots
The concept of rolling windows
How to create rolling windows
How to plot rolling window data
How to create running totals
How to plot running totals
Types of predictive models
Introduction to regression analysis
The Housing dataset
How to identify correlations with a scatter plot
How to identify correlations with a grid of scatter plots
How to identify correlations with r-values
How to identify correlations with a heatmap
A procedure for creating and using a regression model
The function and methods for linear regression models
How to create, validate, and use a linear regression model
How to plot the predicted data
How to plot the residuals
The lmplot() method and some of its parameters
How to plot a simple linear regression
How to plot a logistic regression
How to plot a polynomial regression
How to plot a lowess regression
How to use the residplot() method to plot the residuals
The Cars dataset
How to create a simple regression model
How to plot the residuals of a simple regression
How to create a multiple regression model
How to plot the residuals of a multiple regression
How to identify categorical variables
How to review categorical variables
How to create dummy variables
How to rescale the data and check the correlations
How to create a multiple regression that includes dummy variables
How to select the independent variables
How to test different combinations of variables
How to use Scikit-learn to select the variables
How to select the right number of variables
Import the modules that you will need
Get the data
Display the data
Examine the data
Drop columns and rows
Rename columns
Fix object types
Fix data
Take an early plot with Pandas
Save the DataFrame
Add columns for grouping and filtering
Create a new DataFrame in long form
Take an early plot of the long data with Seaborn
Add monthly bins to the DataFrame
Add an average percent column for each month
Save the wide and long DataFrames
Plot the national and swing state polls
Plot the voter types
Plot the last two months of polling
Plot the gap changes in selected states
Prepare the gap data for the last week of polling
Plot the gap data for the last week of polling
Prepare the weekly gap data for the swing states
Plot the weekly gap data for the swing states
Download and unzip the SQLite database
Connect and query the database
Import the data into a DataFrame
Examine the data
Improve the readability of the data
Drop unnecessary rows
Drop duplicate rows
Convert dates to datetime objects
Check for missing contain dates
Add fire_month and days_burning columns
Examine the contain_date and days_burning columns
Analyze the data for California
Two more plots for California fires
Rank the states by total acres burned
Prepare a DataFrame for total acres burned by year within state
Prepare a DataFrame for the top 4 states
Plot the acres burned total by year for the top 4 states
Review the 20 largest fires in California
Use GeoPandas to plot the California map
Use GeoPandas or Seaborn to plot the California fires on a map
Plot the fires in the continental United States
Download and unzip the zip file for the data
Build a DataFrame for the metadata
Use the codebook and read the data that you want
Prepare the data
Plot the data and reduce the number of categories
Plot the total counts of the responses
Convert the counts to percents and plot them
Search the codebook for small question sets
Read and review the work-life data
Plot the responses for the first question
Plot the responses for the second and third questions
Use the codebook to find related columns
Use the codebook to find follow-up questions
Select the columns for an expanded DataFrame
Bin the data for a column
Develop and test a first hypothesis
Develop and test a second hypothesis
Develop and test a third hypothesis
Get the data
Build the DataFrame
Locate and drop unneeded rows
Locate and drop unneeded columns
Convert the game_date column to datetime data
Add a column for the season
Add a column for the shot result
Add a column for points made for each shot
Add three summary columns
Plot the points per game by season
Plot the averages of shots, shots made, and points per game by season
Plot the shot locations for two games
Plot the shot locations for two seasons
Plot the shot density for one season
Plot the shot density for two seasons
How to install Anaconda
How to use the Anaconda Prompt
How to use the Anaconda Navigator
How to install the files for this book
How to make sure Anaconda is installed correctly
How to download the large data files for this book
How to install Anaconda
How to run conda commands
How to use the Anaconda Navigator
How to install the files for this book
How to make sure Anaconda is installed correctly
How to download the large data files for this book
In contrast to other college publishers, we don’t fill dozens of pages in our books with end-of-chapter activities that may never be used. Instead, we provide everything you need for an effective course in a download from our instructor’s website. Then, you decide which of these materials are right for your course.
Here's a summary of the instructor's materials for this book. For a detailed description in PDF, please read the Instructor's Summary.
But don’t worry! We provide additional projects and case studies that you can use for testing, and those solutions are available only to instructors (see below).
To view the "Frequently Asked Questions" for this book in a PDF, just click on this link: View the questions
Then, if you have any questions that aren't answered here, please email us. Thanks!
To view the corrections for this book in a PDF, just click on this link: View the corrections
Then, if you find any other errors, please email us so we can correct them in the next printing of the book. Thank you!
This is our site for college instructors. To buy Murach books, please visit our retail site.