Library/Computer and Information Sciences/Python for Data Analysis: Data Wrangling with pandas, NumPy & Jupyter/Data Cleaning and Preparation

Data Cleaning and Preparation

50 questions available

Take a quiz Listen to a podcast

Summary unavailable.

Questions

Question 1

By default, what is the behavior of the `dropna()` method when applied to a pandas DataFrame?

View answer and explanation

Question 2

What is the effect of passing `how="all"` as an argument to the `data.dropna()` method on a DataFrame?

View answer and explanation

Question 3

Suppose you want to keep only the rows in a DataFrame that have at least a certain number of non-missing values. Which argument should you use with the `dropna()` method?

View answer and explanation

Question 4

When using the `fillna()` method on a DataFrame, what is accomplished by passing a dictionary to it?

View answer and explanation

Question 5

Which method is considered the workhorse function for replacing missing values in a pandas DataFrame or Series?

View answer and explanation

Question 6

What does the DataFrame method `duplicated()` return?

View answer and explanation

Question 7

By default, the `duplicated()` and `drop_duplicates()` methods keep the first observed value combination. How can you modify this behavior to keep the last observed combination instead?

View answer and explanation

Question 8

What is the primary use of the `map` method on a pandas Series in the context of data transformation?

View answer and explanation

Question 9

Given the pandas Series `data = pd.Series([1., -999., 2., -999., -1000., 3.])`, what is the result of calling `data.replace(-999, np.nan)`?

View answer and explanation

Question 10

If you want to replace multiple different values with a single substitute value in a pandas Series, how should you use the `replace` method?

View answer and explanation

Question 11

How can you create a transformed version of a DataFrame with renamed index and column labels without modifying the original DataFrame?

View answer and explanation

Question 12

What is the primary function of `pandas.cut`?

View answer and explanation

Question 13

In the string representation of an interval returned by `pandas.cut`, such as `(18, 25]`, what does the square bracket `]` signify?

View answer and explanation

Question 14

What is the main difference between the `pandas.cut` and `pandas.qcut` functions?

View answer and explanation

Question 15

To select all rows in a DataFrame `data` that have a value in any of their columns exceeding 3 in absolute value, which line of code is correct?

View answer and explanation

Question 16

What does the `numpy.random.permutation()` function produce when called with the length of an axis?

View answer and explanation

Question 17

How can you select a random subset of 3 rows from a DataFrame `df` without replacement?

View answer and explanation

Question 18

What is the purpose of the `pandas.get_dummies` function?

View answer and explanation

Question 19

If a column in a DataFrame contains strings where multiple categories are separated by a delimiter (e.g., 'Animation|Children's|Comedy'), which method is specially designed to create dummy variables from it?

View answer and explanation

Question 20

Why did pandas develop an extension type system, departing from its original reliance on NumPy types?

View answer and explanation

Question 21

When creating a pandas Series of integers with a missing value using an extension type, what data type should be specified to avoid converting the Series to float64?

View answer and explanation

Question 22

What is the primary difference between Python's built-in `find()` and `index()` string methods?

View answer and explanation

Question 23

In the context of regular expressions in Python, why is it highly recommended to use the `re.compile()` function?

View answer and explanation

Question 24

What is the difference between the `re.search()` and `re.match()` methods?

View answer and explanation

Question 25

In pandas, how do you access array-oriented methods for string operations on a Series that correctly handle missing (NA) values?

View answer and explanation

Question 26

Given a pandas Series `data` containing email addresses and NA values, what does the method `data.str.findall(pattern, flags=re.IGNORECASE)` return for a row containing an NA value?

View answer and explanation

Question 27

What is the purpose of the `.str.extract()` method on a pandas Series?

View answer and explanation

Question 28

In data warehousing, what is the best practice for representing a column with many repeated values, as described in the chapter?

View answer and explanation

Question 29

When a pandas Series is converted to the 'category' dtype, what two main components does the underlying Categorical object have?

View answer and explanation

Question 30

If you have an array of integer codes and an array of corresponding category labels from an external source, which constructor should you use to create a `pandas.Categorical` object?

View answer and explanation

Question 31

How can you make an unordered categorical Series instance ordered in pandas?

View answer and explanation

Question 32

Why can GroupBy operations be significantly faster when performed on categorical data compared to string data?

View answer and explanation

Question 33

In a pandas Series `cat_s` with a categorical dtype, how do you access the categorical methods like `set_categories` or `remove_unused_categories`?

View answer and explanation

Question 34

After filtering a large DataFrame, many of the original categories in a categorical column may no longer be present in the data. Which method can be used to trim these unobserved categories?

View answer and explanation

Question 35

What is another term for creating dummy variables from categorical data, as mentioned in the section 'Creating dummy variables for modeling'?

View answer and explanation

Question 36

Consider the Series `s = pd.Series(['a', 'b', 'c', 'd'] * 2, dtype='category')`. What will be the output of `pd.get_dummies(s)`?

View answer and explanation

Question 37

In a pandas Series created with `pd.Series([1, 2, None], dtype='float64')`, what value is at index 2?

View answer and explanation

Question 38

Given a DataFrame `df`, what is the result of `df.fillna(method="ffill", limit=2)`?

View answer and explanation

Question 39

What does the `precision` argument in `pd.cut(data, 4, precision=2)` do?

View answer and explanation

Question 40

Consider the code `data[data.abs() > 3] = np.sign(data) * 3`. What is its effect on the DataFrame `data`?

View answer and explanation

Question 41

What is the difference between `data.replace()` and `data.str.replace()` for a pandas Series?

View answer and explanation

Question 42

In regular expressions, what does the `findall` method return when the pattern contains capturing groups?

View answer and explanation

Question 43

How can you slice substrings from each element in a pandas Series `data` in a vectorized way?

View answer and explanation

Question 44

Consider the code `pd.get_dummies(pd.cut(values, bins))`. What is the useful application of this combination of functions?

View answer and explanation

Question 45

If you have a pandas Series `cat_s2` with 5 defined categories ('a' through 'e') but the data only contains 'a', 'b', 'c', 'd', what will `cat_s2.value_counts()` show for category 'e'?

View answer and explanation

Question 46

What is the return type of the `.codes` attribute of a pandas Categorical object?

View answer and explanation

Question 47

Given `ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]` and `bins = [18, 25, 35, 60, 100]`, how many values fall into the `(18, 25]` bin when `pd.cut(ages, bins)` is called?

View answer and explanation

Question 48

Which pandas method is specifically designed to perform a vectorized set membership check?

View answer and explanation

Question 49

What does the pandas `value_counts()` method return?

View answer and explanation

Question 50

How can you get an index array from an array of possibly non-distinct values into another array of distinct values, which is helpful for data alignment?

View answer and explanation

Other chapters

Preliminaries Python Language Basics, IPython, and Jupyter Notebooks Built-In Data Structures, Functions, and Files NumPy Basics: Arrays and Vectorized Computation Getting Started with pandas Data Loading, Storage, and File Formats Data Wrangling: Join, Combine, and Reshape Plotting and Visualization Data Aggregation and Group Operations Time Series Introduction to Modeling Libraries in Python Data Analysis Examples Advanced NumPy More on the IPython System Index