Learning Module 11 Introduction to Big Data Techniques
50 questions available
Key Points
- Fintech enables new data-driven investment processes.
- Big Data defined by volume, velocity, variety (and veracity for trustworthiness).
- Alternative data sources complement traditional financial datasets.
- Main alternative data sources: individuals, business processes, sensors/IoT.
- Data quality and selection issues are central challenges.
Key Points
- ML requires splitting datasets into training, validation, and test subsets.
- Supervised learning models map inputs to labeled outputs; unsupervised finds structure.
- Deep learning uses multi-layer neural networks for complex pattern extraction.
- Overfitting and underfitting are core model-performance risks.
- NLP transforms unstructured text/voice into structured indicators for investment use.
Key Points
- Five data processing tasks: capture, curation, storage, search, transfer.
- Visualization techniques help interpret complex and unstructured data.
- Common programming languages: Python, R, Java, C/C++, Excel VBA.
- Databases: SQL, SQLite, NoSQL depending on data structure and latency needs.
- Applications in investment management are wide-ranging but require legal/ethical diligence.
Questions
Which set of characteristics is the canonical definition of Big Data as described in the chapter?
View answer and explanationWhat does the fourth V, veracity, refer to when working with Big Data?
View answer and explanationWhich of the following is an example of alternative data generated by business processes?
View answer and explanationWhich type of alternative data is most likely to be unstructured and require NLP for processing?
View answer and explanationAn analyst wants to detect consumer sentiment toward a new product using millions of tweets. Which method is most appropriate according to the chapter?
View answer and explanationWhich ML category would you use to group companies into peer clusters without pre-labeled categories?
View answer and explanationWhich description best matches supervised learning as presented in the chapter?
View answer and explanationDeep learning typically differs from classical ML primarily by:
View answer and explanationWhat is overfitting in ML, according to the chapter?
View answer and explanationAn ML practitioner splits data into training, validation, and test sets. What is the correct primary use of the validation set?
View answer and explanationWhich programming language is noted in the chapter as being open-source, beginner-friendly, and commonly used for ML and fintech applications?
View answer and explanationWhich database type is most appropriate for large-scale unstructured data according to the chapter?
View answer and explanationWhich data processing stage focuses primarily on detecting bad data, missing values, and correcting errors before analysis?
View answer and explanationWhich visualization technique is particularly useful for showing word frequency in a corpus as emphasized in the chapter?
View answer and explanationWhich of the following is an ethical or legal concern about alternative data mentioned in the chapter?
View answer and explanationAn analyst finds a model with excellent fit on the training set but poor predictive accuracy on new data. According to the chapter, what likely happened?
View answer and explanationWhich application mentioned in the chapter uses image recognition on satellite data to provide investment insights?
View answer and explanationWhich of the following best captures a data scientist's capture-stage concern for an automated trading application?
View answer and explanationWhich programming language noted in the chapter is traditionally favored for advanced statistical analysis and has many packages for econometrics and optimization?
View answer and explanationAccording to the chapter, which data storage choice is appropriate when you need an embedded database for a mobile app with structured data?
View answer and explanationWhich one of these is NOT a common step in the data science pipeline as outlined in the chapter?
View answer and explanationIf a dataset contains many missing values and some extreme outliers, the chapter recommends addressing these issues during which stage?
View answer and explanationAn investment team uses NLP to score tone in central bank speeches. Which of the following is a valid application described in the chapter?
View answer and explanationWhich of the following best reflects the chapter's treatment of ML model opacity?
View answer and explanationWhat is a primary advantage of using alternative datasets (e.g., credit-card or satellite data) in investment models as described in the chapter?
View answer and explanationWhich of the following is an example of sensor-generated data highlighted in the chapter?
View answer and explanationConsider an ML model trained on 10,000 labeled examples that achieves 95 percent accuracy on training data but only 60 percent on a holdout test set. Which remedial action aligns with chapter guidance?
View answer and explanationWhich practical visualization method from the chapter helps explore residual randomness for a regression model?
View answer and explanationWhich of the following best summarizes why ML historically was limited before recent advances, according to the chapter?
View answer and explanationA small hedge fund can only obtain 200 labeled examples for training a supervised ML classifier. Based on chapter guidance, which statement is most accurate?
View answer and explanationWhich of the following best describes corporate 'exhaust' as defined in the chapter?
View answer and explanationWhich ML model-evaluation practice is explicitly recommended in the chapter to assess generalization performance?
View answer and explanationWhich of the following is an example of a semistructured data format listed in the chapter?
View answer and explanationWhich approach from the chapter would be most useful to reduce dimension or visualize relationships across many numeric features?
View answer and explanationA dataset for an investment model contains items in several different languages. The chapter suggests which of the following regarding NLP application?
View answer and explanationWhich of the following is NOT a stated advantage of Python in the chapter?
View answer and explanationWhich of the following best describes the chapter's recommendation on comparing model fits between different functional forms (e.g., lin-lin vs log-lin)?
View answer and explanationWhich of the following actions would the chapter suggest to evaluate whether alternative data provide predictive value for returns?
View answer and explanationA modeler wants to store large volumes of streaming market tick data and query it in real time. According to the chapter, which combination is most relevant to consider?
View answer and explanationIf a researcher uses tag clouds to summarize an earnings call transcript, which insight does the chapter suggest is most directly provided?
View answer and explanationQuantitative question: A dataset comprises 2 million daily sensor records and 50,000 labeled training examples for supervised learning. According to chapter guidance, which strategy is most consistent with recommended practice?
View answer and explanationQuantitative question: A tag cloud of a 10,000-word document shows the top word occurs 400 times, the second 200 times, and the third 100 times. What percent of words do the top three words represent?
View answer and explanationQuantitative question: A classification model yields 90 percent accuracy on training and 75 percent accuracy on validation. If the model was simplified and validation accuracy rose to 80 percent but training dropped to 85 percent, which change aligns with chapter recommendations?
View answer and explanationQuantitative question: An NLP sentiment score is computed as (positive_count - negative_count) / total_words for an earnings call transcript of 2,000 words. If positive_count = 120 and negative_count = 80, what is the sentiment score?
View answer and explanationQuantitative question: A firm collects 10 terabytes of IoT sensor data per month. If a petabyte equals 1000 terabytes, how many months of data are required to reach one petabyte?
View answer and explanationQuantitative question: A sentiment model gives an average predicted alpha contribution of 0.12 percent per quarter with an estimated standard error of 0.04 percent. What is the 95 percent approximate t-based confidence interval for the quarterly alpha contribution (use t ~ 2)?
View answer and explanationQuantitative question: A dataset contains tweets with 10 million total words. If a keyword appears with frequency 50,000, what is its approximate frequency per 10,000 words?
View answer and explanationQuantitative question: You conduct A/B testing using click-through rates. Group A has 2,400 clicks from 80,000 impressions; Group B has 2,640 clicks from 80,000 impressions. Which group has the higher CTR and by how many basis points (1 basis point = 0.01 percent)?
View answer and explanationQuantitative question: A predictive model based on satellite parking-lot counts yields monthly signals with a sample mean of 0.8 percent expected revenue uplift and sample standard deviation 1.6 percent across 36 months. What is the standard error of the mean uplift?
View answer and explanationQuantitative question: An analyst stores high-frequency trade data producing 120 GB per day. How many days of data will fill a 10 TB storage device (1 TB = 1000 GB) approximately?
View answer and explanation