Learning Module 11 Introduction to Big Data Techniques

50 questions available

Fintech and Big Data5 min
Fintech in investment analysis refers to technology-driven innovations that affect the collection and analysis of financial data, including Big Data, AI, and ML. Big Data is characterized by volume (massive amounts of data), velocity (fast or real-time data streams), and variety (structured, semi-structured, unstructured formats). When used for inference or prediction, veracity (data quality and credibility) is also crucial. Traditional data sources include market prices, company filings, and government statistics; alternative data sources include social media, web logs, corporate exhaust, credit-card and transaction data, satellite imagery, sensors, and IoT devices. Alternative data sources are often classified by origin: data generated by individuals (social posts, search logs), business processes (point-of-sale, corporate exhaust), and sensors (satellites, RFID, smart devices). Big Data challenges include selection bias, missingness, outliers, data cleaning complexity, and suitability of datasets for intended analyses.

Key Points

  • Fintech enables new data-driven investment processes.
  • Big Data defined by volume, velocity, variety (and veracity for trustworthiness).
  • Alternative data sources complement traditional financial datasets.
  • Main alternative data sources: individuals, business processes, sensors/IoT.
  • Data quality and selection issues are central challenges.
AI, Machine Learning, and NLP5 min
Artificial intelligence (AI) systems perform tasks that traditionally required human intelligence; machine learning (ML) automates pattern discovery from data without assuming a parametric probability distribution. ML workflows split data into training, validation, and test sets. Main ML categories are supervised learning (labeled inputs and outputs), unsupervised learning (discovering structure without labeled outputs), and deep learning (neural networks with many layers for multistage non-linear processing). Risks and limitations of ML include overfitting (model learns noise as signal), underfitting (model too simple to capture true patterns), dependency on large and clean datasets, and opacity of complex models. Natural language processing (NLP) is a text analytics application combining linguistics, AI, and statistics to extract meaning from text and voice; common NLP tasks include sentiment analysis, topic detection, translation, and speech recognition. NLP is widely used to analyze company filings, earnings call transcripts, policy speeches, and social media to detect sentiment shifts and trends.

Key Points

  • ML requires splitting datasets into training, validation, and test subsets.
  • Supervised learning models map inputs to labeled outputs; unsupervised finds structure.
  • Deep learning uses multi-layer neural networks for complex pattern extraction.
  • Overfitting and underfitting are core model-performance risks.
  • NLP transforms unstructured text/voice into structured indicators for investment use.
Data Science Workflow, Tools, and Applications5 min
Data science is an interdisciplinary approach leveraging statistics, computer science, and domain knowledge to extract insights from Big Data; core data processing methods include capture (collecting and formatting data), curation (cleaning and preparing data), storage (choosing appropriate databases and latency characteristics), search (querying data efficiently), and transfer (feeding data to analysis pipelines). Data visualization methods help interpret both structured and unstructured data and include interactive 3D visuals, heat maps, tree diagrams, network graphs, tag clouds, and mind maps. Programming languages commonly used in data science include Python (general-purpose, beginner-friendly, strong ML libraries), R (statistical analysis), Java (portable applications), C/C++ (high-performance computing), and Excel VBA (automation). Databases include SQL (structured data), SQLite (embedded structured DB for apps), and NoSQL (unstructured or flexible-schema data). Applications of Big Data and ML in investment management include algorithmic trading, risk analysis, trade execution optimization, asset selection enhancement, and alternative data-driven alpha generation. Practical concerns include legal and ethical issues surrounding alternative data (privacy, scraping, consent), and the need to evaluate model performance using out-of-sample testing, validation sets, and careful residual analysis. Successful deployment requires domain knowledge to select suitable data, rigorous data cleaning, proper model selection (avoiding over- and underfitting), and monitoring model behavior after deployment.

Key Points

  • Five data processing tasks: capture, curation, storage, search, transfer.
  • Visualization techniques help interpret complex and unstructured data.
  • Common programming languages: Python, R, Java, C/C++, Excel VBA.
  • Databases: SQL, SQLite, NoSQL depending on data structure and latency needs.
  • Applications in investment management are wide-ranging but require legal/ethical diligence.

Questions

Question 1

Which set of characteristics is the canonical definition of Big Data as described in the chapter?

View answer and explanation
Question 2

What does the fourth V, veracity, refer to when working with Big Data?

View answer and explanation
Question 3

Which of the following is an example of alternative data generated by business processes?

View answer and explanation
Question 4

Which type of alternative data is most likely to be unstructured and require NLP for processing?

View answer and explanation
Question 5

An analyst wants to detect consumer sentiment toward a new product using millions of tweets. Which method is most appropriate according to the chapter?

View answer and explanation
Question 6

Which ML category would you use to group companies into peer clusters without pre-labeled categories?

View answer and explanation
Question 7

Which description best matches supervised learning as presented in the chapter?

View answer and explanation
Question 8

Deep learning typically differs from classical ML primarily by:

View answer and explanation
Question 9

What is overfitting in ML, according to the chapter?

View answer and explanation
Question 10

An ML practitioner splits data into training, validation, and test sets. What is the correct primary use of the validation set?

View answer and explanation
Question 11

Which programming language is noted in the chapter as being open-source, beginner-friendly, and commonly used for ML and fintech applications?

View answer and explanation
Question 12

Which database type is most appropriate for large-scale unstructured data according to the chapter?

View answer and explanation
Question 13

Which data processing stage focuses primarily on detecting bad data, missing values, and correcting errors before analysis?

View answer and explanation
Question 14

Which visualization technique is particularly useful for showing word frequency in a corpus as emphasized in the chapter?

View answer and explanation
Question 15

Which of the following is an ethical or legal concern about alternative data mentioned in the chapter?

View answer and explanation
Question 16

An analyst finds a model with excellent fit on the training set but poor predictive accuracy on new data. According to the chapter, what likely happened?

View answer and explanation
Question 17

Which application mentioned in the chapter uses image recognition on satellite data to provide investment insights?

View answer and explanation
Question 18

Which of the following best captures a data scientist's capture-stage concern for an automated trading application?

View answer and explanation
Question 19

Which programming language noted in the chapter is traditionally favored for advanced statistical analysis and has many packages for econometrics and optimization?

View answer and explanation
Question 20

According to the chapter, which data storage choice is appropriate when you need an embedded database for a mobile app with structured data?

View answer and explanation
Question 21

Which one of these is NOT a common step in the data science pipeline as outlined in the chapter?

View answer and explanation
Question 22

If a dataset contains many missing values and some extreme outliers, the chapter recommends addressing these issues during which stage?

View answer and explanation
Question 23

An investment team uses NLP to score tone in central bank speeches. Which of the following is a valid application described in the chapter?

View answer and explanation
Question 24

Which of the following best reflects the chapter's treatment of ML model opacity?

View answer and explanation
Question 25

What is a primary advantage of using alternative datasets (e.g., credit-card or satellite data) in investment models as described in the chapter?

View answer and explanation
Question 26

Which of the following is an example of sensor-generated data highlighted in the chapter?

View answer and explanation
Question 27

Consider an ML model trained on 10,000 labeled examples that achieves 95 percent accuracy on training data but only 60 percent on a holdout test set. Which remedial action aligns with chapter guidance?

View answer and explanation
Question 28

Which practical visualization method from the chapter helps explore residual randomness for a regression model?

View answer and explanation
Question 29

Which of the following best summarizes why ML historically was limited before recent advances, according to the chapter?

View answer and explanation
Question 30

A small hedge fund can only obtain 200 labeled examples for training a supervised ML classifier. Based on chapter guidance, which statement is most accurate?

View answer and explanation
Question 31

Which of the following best describes corporate 'exhaust' as defined in the chapter?

View answer and explanation
Question 32

Which ML model-evaluation practice is explicitly recommended in the chapter to assess generalization performance?

View answer and explanation
Question 33

Which of the following is an example of a semistructured data format listed in the chapter?

View answer and explanation
Question 34

Which approach from the chapter would be most useful to reduce dimension or visualize relationships across many numeric features?

View answer and explanation
Question 35

A dataset for an investment model contains items in several different languages. The chapter suggests which of the following regarding NLP application?

View answer and explanation
Question 36

Which of the following is NOT a stated advantage of Python in the chapter?

View answer and explanation
Question 37

Which of the following best describes the chapter's recommendation on comparing model fits between different functional forms (e.g., lin-lin vs log-lin)?

View answer and explanation
Question 38

Which of the following actions would the chapter suggest to evaluate whether alternative data provide predictive value for returns?

View answer and explanation
Question 39

A modeler wants to store large volumes of streaming market tick data and query it in real time. According to the chapter, which combination is most relevant to consider?

View answer and explanation
Question 40

If a researcher uses tag clouds to summarize an earnings call transcript, which insight does the chapter suggest is most directly provided?

View answer and explanation
Question 41

Quantitative question: A dataset comprises 2 million daily sensor records and 50,000 labeled training examples for supervised learning. According to chapter guidance, which strategy is most consistent with recommended practice?

View answer and explanation
Question 42

Quantitative question: A tag cloud of a 10,000-word document shows the top word occurs 400 times, the second 200 times, and the third 100 times. What percent of words do the top three words represent?

View answer and explanation
Question 43

Quantitative question: A classification model yields 90 percent accuracy on training and 75 percent accuracy on validation. If the model was simplified and validation accuracy rose to 80 percent but training dropped to 85 percent, which change aligns with chapter recommendations?

View answer and explanation
Question 44

Quantitative question: An NLP sentiment score is computed as (positive_count - negative_count) / total_words for an earnings call transcript of 2,000 words. If positive_count = 120 and negative_count = 80, what is the sentiment score?

View answer and explanation
Question 45

Quantitative question: A firm collects 10 terabytes of IoT sensor data per month. If a petabyte equals 1000 terabytes, how many months of data are required to reach one petabyte?

View answer and explanation
Question 46

Quantitative question: A sentiment model gives an average predicted alpha contribution of 0.12 percent per quarter with an estimated standard error of 0.04 percent. What is the 95 percent approximate t-based confidence interval for the quarterly alpha contribution (use t ~ 2)?

View answer and explanation
Question 47

Quantitative question: A dataset contains tweets with 10 million total words. If a keyword appears with frequency 50,000, what is its approximate frequency per 10,000 words?

View answer and explanation
Question 48

Quantitative question: You conduct A/B testing using click-through rates. Group A has 2,400 clicks from 80,000 impressions; Group B has 2,640 clicks from 80,000 impressions. Which group has the higher CTR and by how many basis points (1 basis point = 0.01 percent)?

View answer and explanation
Question 49

Quantitative question: A predictive model based on satellite parking-lot counts yields monthly signals with a sample mean of 0.8 percent expected revenue uplift and sample standard deviation 1.6 percent across 36 months. What is the standard error of the mean uplift?

View answer and explanation
Question 50

Quantitative question: An analyst stores high-frequency trade data producing 120 GB per day. How many days of data will fill a 10 TB storage device (1 TB = 1000 GB) approximately?

View answer and explanation