Which of the following best summarizes why ML historically was limited before recent advances, according to the chapter?

Correct answer: Insufficient data and computing power limited ML performance until Big Data and faster processors became available

Explanation

ML benefit from both larger datasets and advances in algorithms and hardware to become more effective.

Other questions

Question 1

Which set of characteristics is the canonical definition of Big Data as described in the chapter?

Question 2

What does the fourth V, veracity, refer to when working with Big Data?

Question 3

Which of the following is an example of alternative data generated by business processes?

Question 4

Which type of alternative data is most likely to be unstructured and require NLP for processing?

Question 5

An analyst wants to detect consumer sentiment toward a new product using millions of tweets. Which method is most appropriate according to the chapter?

Question 6

Which ML category would you use to group companies into peer clusters without pre-labeled categories?

Question 7

Which description best matches supervised learning as presented in the chapter?

Question 8

Deep learning typically differs from classical ML primarily by:

Question 9

What is overfitting in ML, according to the chapter?

Question 10

An ML practitioner splits data into training, validation, and test sets. What is the correct primary use of the validation set?

Question 11

Which programming language is noted in the chapter as being open-source, beginner-friendly, and commonly used for ML and fintech applications?

Question 12

Which database type is most appropriate for large-scale unstructured data according to the chapter?

Question 13

Which data processing stage focuses primarily on detecting bad data, missing values, and correcting errors before analysis?

Question 14

Which visualization technique is particularly useful for showing word frequency in a corpus as emphasized in the chapter?

Question 15

Which of the following is an ethical or legal concern about alternative data mentioned in the chapter?

Question 16

An analyst finds a model with excellent fit on the training set but poor predictive accuracy on new data. According to the chapter, what likely happened?

Question 17

Which application mentioned in the chapter uses image recognition on satellite data to provide investment insights?

Question 18

Which of the following best captures a data scientist's capture-stage concern for an automated trading application?

Question 19

Which programming language noted in the chapter is traditionally favored for advanced statistical analysis and has many packages for econometrics and optimization?

Question 20

According to the chapter, which data storage choice is appropriate when you need an embedded database for a mobile app with structured data?

Question 21

Which one of these is NOT a common step in the data science pipeline as outlined in the chapter?

Question 22

If a dataset contains many missing values and some extreme outliers, the chapter recommends addressing these issues during which stage?

Question 23

An investment team uses NLP to score tone in central bank speeches. Which of the following is a valid application described in the chapter?

Question 24

Which of the following best reflects the chapter's treatment of ML model opacity?

Question 25

What is a primary advantage of using alternative datasets (e.g., credit-card or satellite data) in investment models as described in the chapter?

Question 26

Which of the following is an example of sensor-generated data highlighted in the chapter?

Question 27

Consider an ML model trained on 10,000 labeled examples that achieves 95 percent accuracy on training data but only 60 percent on a holdout test set. Which remedial action aligns with chapter guidance?

Question 28

Which practical visualization method from the chapter helps explore residual randomness for a regression model?

Question 30

A small hedge fund can only obtain 200 labeled examples for training a supervised ML classifier. Based on chapter guidance, which statement is most accurate?

Question 31

Which of the following best describes corporate 'exhaust' as defined in the chapter?

Question 32

Which ML model-evaluation practice is explicitly recommended in the chapter to assess generalization performance?

Question 33

Which of the following is an example of a semistructured data format listed in the chapter?

Question 34

Which approach from the chapter would be most useful to reduce dimension or visualize relationships across many numeric features?

Question 35

A dataset for an investment model contains items in several different languages. The chapter suggests which of the following regarding NLP application?

Question 36

Which of the following is NOT a stated advantage of Python in the chapter?

Question 37

Which of the following best describes the chapter's recommendation on comparing model fits between different functional forms (e.g., lin-lin vs log-lin)?

Question 38

Which of the following actions would the chapter suggest to evaluate whether alternative data provide predictive value for returns?

Question 39

A modeler wants to store large volumes of streaming market tick data and query it in real time. According to the chapter, which combination is most relevant to consider?

Question 40

If a researcher uses tag clouds to summarize an earnings call transcript, which insight does the chapter suggest is most directly provided?

Question 41

Quantitative question: A dataset comprises 2 million daily sensor records and 50,000 labeled training examples for supervised learning. According to chapter guidance, which strategy is most consistent with recommended practice?

Question 42

Quantitative question: A tag cloud of a 10,000-word document shows the top word occurs 400 times, the second 200 times, and the third 100 times. What percent of words do the top three words represent?

Question 43

Quantitative question: A classification model yields 90 percent accuracy on training and 75 percent accuracy on validation. If the model was simplified and validation accuracy rose to 80 percent but training dropped to 85 percent, which change aligns with chapter recommendations?

Question 44

Quantitative question: An NLP sentiment score is computed as (positive_count - negative_count) / total_words for an earnings call transcript of 2,000 words. If positive_count = 120 and negative_count = 80, what is the sentiment score?

Question 45

Quantitative question: A firm collects 10 terabytes of IoT sensor data per month. If a petabyte equals 1000 terabytes, how many months of data are required to reach one petabyte?

Question 46

Quantitative question: A sentiment model gives an average predicted alpha contribution of 0.12 percent per quarter with an estimated standard error of 0.04 percent. What is the 95 percent approximate t-based confidence interval for the quarterly alpha contribution (use t ~ 2)?

Question 47

Quantitative question: A dataset contains tweets with 10 million total words. If a keyword appears with frequency 50,000, what is its approximate frequency per 10,000 words?

Question 48

Quantitative question: You conduct A/B testing using click-through rates. Group A has 2,400 clicks from 80,000 impressions; Group B has 2,640 clicks from 80,000 impressions. Which group has the higher CTR and by how many basis points (1 basis point = 0.01 percent)?

Question 49

Quantitative question: A predictive model based on satellite parking-lot counts yields monthly signals with a sample mean of 0.8 percent expected revenue uplift and sample standard deviation 1.6 percent across 36 months. What is the standard error of the mean uplift?

Question 50

Quantitative question: An analyst stores high-frequency trade data producing 120 GB per day. How many days of data will fill a 10 TB storage device (1 TB = 1000 GB) approximately?