Library/Computer and Information Sciences/Python for Data Analysis: Data Wrangling with pandas, NumPy & Jupyter/Data Analysis Examples

Question 21 of 50

Take a quiz Listen to a podcast

In the MovieLens 1M Dataset, what is the result of using `data.pivot_table("rating", index="title", columns="gender", aggfunc="mean")`?

Correct answer: A DataFrame with movie titles as the index, gender ('F', 'M') as columns, and the mean rating as the values.

Explanation

The `pivot_table` method is a powerful tool for reshaping and summarizing data. In this example, `index='title'` sets the movie titles as the rows, `columns='gender'` creates new columns for each unique gender, `values='rating'` specifies which data to aggregate, and `aggfunc='mean'` calculates the average of those values for each title/gender combination.

Back to chapter overview

Previous Next

Other questions

Question 1

In the context of the Bitly data analysis example, what is the primary purpose of using `json.loads` within a list comprehension when reading the data file?

Question 2

When initially trying to extract time zones from the Bitly dataset using `[rec["tz"] for rec in records]`, a `KeyError` occurs. Why does this error happen?

Question 3

In the Bitly data analysis, how is the issue of missing and empty string time zones handled before creating a final visualization with seaborn?

Question 4

How can you decompose the Bitly time zone data into Windows and non-Windows users and then reshape it into a summary table?

Question 5

In the MovieLens 1M dataset analysis, what is the primary reason for merging the `ratings`, `users`, and `movies` DataFrames into a single DataFrame named `data`?

Question 6

How are movies that received at least 250 ratings identified in the MovieLens data analysis?

Question 7

What does the `explode` method accomplish when used on the 'genres' column in the MovieLens dataset?

Question 8

In the US Baby Names analysis, what is the purpose of passing `ignore_index=True` to `pd.concat` when assembling the yearly data files?

Question 9

How is the 'prop' column, representing the proportion of babies with a given name for a specific year and sex, calculated in the US Baby Names dataset?

Question 10

What does the code `prop_cumsum.searchsorted(0.5)` accomplish in the analysis of naming diversity in the US Baby Names dataset?

Question 11

In the USDA Food Database example, how is the complete `nutrients` DataFrame constructed from the nested JSON data?

Question 12

What is the purpose of renaming the 'description' and 'group' columns in both the `info` and `nutrients` DataFrames in the USDA food analysis?

Question 13

How can you find the food with the highest amount of a given nutrient for each nutrient group in the USDA dataset?

Question 14

In the 2012 Federal Election Commission (FEC) data analysis, how is a 'party' column added to the DataFrame?

Question 15

What is the purpose of bucketing the donation amounts using `pd.cut` in the FEC data analysis?

Question 16

After grouping the FEC data by candidate and donation bucket, the code shows `bucket_sums.div(bucket_sums.sum(axis="columns"), axis="index")`. What does this operation calculate?

Question 17

In the Bitly data analysis, what is the value of the 'America/New_York' time zone count after running `tz_counts = frame["tz"].value_counts()`?

Question 18

In the MovieLens dataset, which movie has the largest negative rating difference, indicating it was preferred much more by female viewers than male viewers?

Question 19

According to the US Baby Names analysis, how many of the most popular boy names in 1900 were required to make up 50 percent of the total male births?

Question 20

In the FEC data analysis, which occupation represents the highest total donation amount for the candidate 'Romney, Mitt' among the top 7 occupations listed?

Question 22

What is the primary characteristic of the 'last letter' revolution analysis in the US Baby Names dataset?

Question 23

In the Bitly data analysis, to normalize the counts of Windows vs. non-Windows users for each time zone to sum to 1, which pandas method is shown to be more efficient than using `apply`?

Question 24

What is the primary data structure of the `db` object after loading the USDA food database with `json.load`?

Question 25

In the analysis of the name 'Lesley' and its variants in the US Baby Names dataset, what does the final plot generated by `table.plot(style={"M": "k-", "F": "k--"})` show?

Question 26

In the FEC data analysis, how are various spellings and phrasings for occupations like 'INFORMATION REQUESTED' and 'C.E.O.' cleaned up?

Question 27

In the MovieLens analysis, which movie is identified as the most divisively rated, based on the standard deviation of its ratings?

Question 28

What is the total number of rows in the final `names` DataFrame after concatenating all US Baby Names files from 1880 to 2010?

Question 29

In the USDA food database, what is the food group with the highest median Zinc (Zn) value according to the bar plot?

Question 30

How many records (rows) are in the MovieLens 1M dataset's `ratings` table before any merging?

Question 31

Which pandas operation is used to create the `total_births` pivot table in the US Baby Names analysis, showing total births by year and sex?

Question 32

In the Bitly data analysis, what does `agg_counts.sum("columns").argsort()` compute?

Question 33

When analyzing the 2012 FEC data, after bucketing donations, how many donations did 'Obama, Barack' receive in the (10, 100] dollar bucket?

Question 34

What is the data type (Dtype) of the 'contb_receipt_amt' column in the FEC dataset after being loaded by `pd.read_csv`?

Question 35

Which food is identified as having the most 'Alanine' in the USDA food database analysis?

Question 36

In the US Baby Names analysis, what is the proportion of male births in 1910 that had names ending in the letter 'd'?

Question 37

Which python library and specific class is used to efficiently count time zones in the Bitly data example as an alternative to a manual dictionary loop?

Question 38

In the MovieLens dataset analysis, what is the engine specified in the `pd.read_table` function and why might it be necessary?

Question 39

In the analysis of the US Baby Names, what is the purpose of the `get_top1000` function?

Question 40

What does the code `fec[fec["contb_receipt_amt"] > 0]` achieve in the FEC data analysis?

Question 41

In the Bitly data analysis, what is the agent string for the first token split from `frame["a"][1]`?

Question 42

What is the total number of non-null values in the 'manufacturer' column of the USDA food database `info` DataFrame?

Question 43

In the MovieLens analysis, what is the mean rating for the 'Action' genre by viewers in the '18' age group?

Question 44

In the US Baby Names analysis, how many births were there for the name 'Mary' for sex 'F' in the year 1880?

Question 45

What is the total donation amount from the state of California ('CA') to 'Obama, Barack' in the FEC dataset analysis?

Question 46

What is the primary motivation for creating the `fec_mrbo` subset in the FEC data analysis?

Question 47

In the MovieLens analysis, the code `movies["genre"] = movies.pop("genres").str.split("|")` performs two actions. What are they?

Question 48

What is the count of the 'Vegetables and Vegetable Products' food group in the USDA database?

Question 49

In the US Baby Names analysis, how is the trend of the proportion of boys born with names ending in 'd', 'n', and 'y' plotted over time?

Question 50

Which aggregation function is used by default when creating a `pivot_table` in pandas if `aggfunc` is not specified?