Python for Data Science

VTU - MCS104A

Level 1 Questions (5-6 Marks)

Module 1

Using Python, write a snippet to:
- Store the heights of 5 students in a list
- Convert it to a NumPy array
- Print the mean height.
Given the following Python code:
import numpy as np data = np.array([10, 20, 30, 40, 50])
print(np.mean(data), np.std(data))
What do the two printed values represent in the context of statistical inference?
A company records all user clicks on its website for one day and stores them in a file.
- Identify the population and a possible sample for a data‑science study.
- State whether “time spent on page (in seconds)” is discrete or continuous data.
Classify each as nominal, ordinal, interval, or ratio:
a) Temperature in °C recorded every hour
b) Course grades A/B/C/D/F
c) Number of products purchased in a transaction
d) Country of residence of a customer
In Python with NumPy, generate 1000 samples from a standard Normal distribution and print the approximate mean and standard deviation.
Give two examples of user behavior that are easily datafied and briefly say how they can later be used in a Python‑based data‑science project.
Write a SQL DDL statement to create a table Customer with columns:
- CustomerID (integer, primary key)
- Name (VARCHAR(50), not null)
- Email (VARCHAR(100), unique)
Write a SQL DML INSERT statement to add one row into the Customer table created in Question 7.
Using SQL, write a simple SELECT query to return all columns from Customer whose Name is 'Rahul'.
Consider two tables:
Department(DeptID, DeptName) and Employee(EmpID, EmpName, DeptID)
- Identify the primary key in each table.
- Identify the foreign key.
Give one real‑world example each of:
- One‑to‑one relationship
- One‑to‑many relationship
- Many‑to‑many relationship
  and mention how they might be represented in a relational database.
Using Python and pandas, write code to:
- Read a CSV file sales.csv into a DataFrame df
- Display the first 5 rows
- Print the column names
Using pandas, given a DataFrame df with numeric column Sales, write a one‑line command to filter all rows where Sales > 1000.
Using scipy.stats, show Python code to compute the CDF at x = 1.5 for the standard Normal distribution.
Given the following pseudo‑code describing the data science process:
1. Collect data
2. Clean data
3. Fit model
4. Evaluate model
5. Communicate results
  Map each step to one appropriate Python library (e.g., pandas, NumPy, matplotlib, scikit‑learn).
For each of the following, state whether it is more likely structured or unstructured and justify in one line:
a) A transaction table in MySQL
b) A collection of customer support emails
c) A CSV file with daily stock prices
Using SQL, write a DELETE statement to remove all rows from table Orders where Status = 'Cancelled'.
Consider a simple linear model Weight=2.5×Height+30.
Explain in one or two sentences what is meant by “fitting the model” in Python using a dataset of (height, weight) pairs.
Using matplotlib in Python, write code to plot a histogram of a NumPy array data with 30 bins and suitable labels for axes and title.
Compare R and Python for Data Science.

Module 2

In the context of data science, what is the main purpose of Exploratory Data Analysis (EDA) according to John Tukey?
List any three basic tools of EDA mentioned in your materials.
Distinguish between a chart, a graph, and a plot as per the Module 2 slides.
State two reasons why we perform EDA before building machine learning models.
The Iris dataset has 150 records and 4 features.
a) What does df.shape return for this dataset?
b) What do the two numbers represent?
Write one line of Python to see the first 3 rows of a pandas DataFrame df.
Which pandas method would you use to:
a) Get the last 5 rows of a DataFrame?
b) Get a statistical summary of all numeric columns?
What does df.info() help you to check? Mention at least two things it shows.
In the Iris dataset, identify:
a) One feature column
b) The target column
Classify each of the following as numerical continuous, numerical discrete, or categorical in the Iris dataset context:
a) Sepal length (cm)
b) Species code (0, 1, 2)
c) Species name (“setosa”, “versicolor”, “virginica”)
What is the range of Pearson’s correlation coefficient r? What does r = 0 imply?
In one sentence each, define:
a) Missing value
b) Duplicate row
c) Outlier
Give one real-world reason why missing values can occur in a dataset.
Write a single line of Python code to count missing values per column in df.
In pandas, what does df.dropna() do by default?
Give one situation where you would fill missing values with the mean or median instead of dropping rows.
Write one line of Python to fill missing values in a numeric column df['sepal length (cm)'] with its median.
What does df.duplicated().sum() tell you about a DataFrame?
Write one line of Python to remove duplicate rows from a DataFrame df and store the result in df_clean.
Name the two statistical values used to compute the Interquartile Range (IQR).

Go Up

Level 2 Questions (8 Marks)

Module 1

Chapter 1

You are given a CSV file customers.csv with columns: CustomerID, Name, Age, City, AnnualIncome.
Write Python (using pandas) to:
a) Load the file
b) Display the summary statistics (mean, min, max) of AnnualIncome
c) Show only those customers whose Age is greater than the mean age.
Using your knowledge of levels of measurement, justify why it is invalid to compute an average on:
a) gender column coded as Male/Female
but valid to compute an average on:
b) ExamScore column (0–100).
Then show a short pandas snippet that computes the mean exam score.
You have a NumPy array x representing 500 simulated observations from an Exponential distribution with rate λ = 0.2.
a) Write the Python code using numpy.random to generate x.
b) Using scipy.stats.expon, estimate λ from the data.
Write a Python function using NumPy that takes a 1‑D array of numerical data and returns a dictionary containing:
{'mean': ..., 'median': ..., 'std': ...}.
Explain briefly how such a function supports statistical inference in data science.
Consider a dataset of daily sales stored in a relational database table DailySales(Date, StoreID, SalesAmount).
a) Write a SQL query to find the total monthly sales for each store.
b) Suggest how you would visualize these totals in Python using matplotlib or seaborn.
Using SQL:
- Create two tables Product(ProductID, ProductName, Price) and OrderItem(OrderID, ProductID, Quantity) with suitable primary and foreign key constraints.
- Then write a query to list each OrderID with the total order value (sum of Price × Quantity).
Construct a short real‑world example (3–4 lines) where a biased sample leads to a wrong business conclusion, and suggest one way to improve the sampling strategy.
A dataset df has missing values in the Salary column. Using pandas:
a) Count the number of missing values in Salary.
b) Create a new column Salary_filled where missing values are replaced by the column median.
c) Explain why median might be preferred over mean here.
Use Python (NumPy + matplotlib) to simulate and plot:
- 1000 samples from a Normal(0, 1) distribution
- 1000 samples from a Uniform(−3, 3) distribution
  on the same figure with different colors and a legend.
  Comment briefly on the differences in shape.
In the context of a simple linear regression using scikit-learn, describe:
a) One situation that might cause underfitting
b) One that might cause overfitting
c) One Python technique to reduce overfitting.
Write a Python function that:
- Accepts parameters mu and sigma
- Uses scipy.stats.norm to compute the probability that a Normal(μ, σ) random variable lies between mu − sigma and mu + sigma.
1. Using pandas, given a DataFrame df with columns Dept, Employee, and Salary, write code to compute the average salary per department and plot it as a bar chart using matplotlib.
Describe a Python‑based approach for detecting and handling outliers in a numeric feature using:
a) The IQR method (conceptual steps)
b) Pandas/NumPy operations you would use.
Consider a binary classification problem where you predict whether a transaction is fraudulent or legitimate.
Outline a small Python‑based workflow (bullet points) from data loading to model evaluation, using libraries mentioned in Python_for_DS.pdf (pandas, NumPy, scikit‑learn, matplotlib).
Using SQL, write a single query to:
- Select all employees from Employee(EmpID, Name, Department, Salary)
- Who are in the IT department
- And have salary greater than the average salary of all employees.
Suppose you initially created a table Student(RegNo, Name, Branch) and later decide to add Email and increase the size of Name.
Write appropriate ALTER TABLE statements.
For the following types of data, give a real‑world example and say how you would store it for later analysis with Python:
a) Temporal data
b) Spatial data
c) Temporal‑cum‑spatial (GIS) data
List any four tools in a typical data‑science toolkit, and for each, give a one‑line example of how it would be used together with Python in a single project.
Given a pandas DataFrame df with a datetime column Timestamp and numeric column Value, show code to:
a) Convert Timestamp to pandas datetime
b) Set it as the index
c) Resample the data to daily mean values.
For a retail store, give one example question for:
- Descriptive analytics
- Predictive analytics
- Prescriptive analytics
  and mention one Python library you’d primarily rely on for each.

Module 2

Explain the difference between EDA and data preprocessing in terms of goals and outcomes.
Using the Iris dataset, write Python code to:
a) Load the dataset into a pandas DataFrame with proper column names.
b) Add a Species column with the string names (“setosa”, etc.).
You are given a DataFrame df with Iris data. Write Python code to print:
a) Dataset shape
b) Data type summary
c) Statistical summary
d) Value counts of the Species column
In practical terms, how can EDA guide your choice of machine learning model when working with the Iris dataset?
For each of the following, write the most appropriate pandas method and a short explanation of its use during EDA:
a) head()
b) describe()
c) value_counts()
Consider the EDA objective “Spot anomalies”. Describe how using df.describe() and basic plots could help with this for the Iris dataset.
Explain what is meant by features (X) and target (y), and write Python code to separate them from a DataFrame df that contains the Iris dataset with a Species column.
You find that a column that should be numeric is stored as strings in pandas.
a) Which function do you use to convert it?
b) Write the code line for converting df['sepal length'] to numeric, forcing invalid values to become NaN.
Describe a scenario where you would prefer imputing missing values instead of dropping rows in a real data science project.
Consider this code snippet from the workbook:
most_common = df['Species'].mode()[0]
df['Species'].fillna(most_common, inplace=True)
Explain what each line does and why mode is used here.
Explain the meaning of a positive, negative, and zero Pearson correlation between two numeric variables in an EDA context.
Write a Python snippet using the IQR method (no need to import data) to find the lower_bound and upper_bound for a numeric column df['col'].
Suppose you detect outliers using the IQR method. Explain the difference between removing outliers and capping them using clip(), and write one line of code that caps df['col'] between lower_bound and upper_bound.
Why can outliers be a serious problem for certain machine learning algorithms such as linear regression?
You suspect there are duplicate rows based on ['sepal length (cm)', 'sepal width (cm)'] only. Write a line of code to count how many duplicates exist in just those columns.
Write Python code to:
a) Add two duplicate rows at the end of a DataFrame df
b) Then remove all duplicates and reassign to df_clean
Given the Data Science Process Model discussed in the slides, briefly describe how EDA fits into the overall data science process.
Discuss the trade-off between keeping outliers (with a flag column) vs removing them, in the context of a real-estate price prediction dataset.
In the Iris dataset, which type of plot(s) would you use to visually check:
a) Distribution of sepal length (cm)
b) Relationship between sepal length (cm) and petal length (cm)
c) Class separation among different species?
How does ensuring clean data (through handling missing values, duplicates, and outliers) improve the performance of basic ML algorithms like kNN, Linear Regression, or k-means?

Go Up

Level 3 Questions (8-10 Marks)

Module 1

You are given a raw CSV of online orders (orders_raw.csv) with messy columns: order_id, user_id, order_date, product_name, price, qty, city.
Design a small end‑to‑end Python for Data Science pipeline for Module‑1 scope that includes:
- Reading and cleaning the data with pandas
- Creating normalized tables in an RDBMS (outline at least Orders, Customers, Products with keys and relationships)
- Writing the cleaned data back into the database using Python
- One example SQL query you would run later to support analysis.
Consider a housing price dataset with features: Size(sqft), Bedrooms, LocationCategory, Price.
Describe in detail how you would:
- Identify whether Price vs Size appears approximately linear using Python (NumPy + matplotlib).
- Fit a simple linear regression using scikit‑learn.
  b) Explain how the concept of “model fitting” and “probability distribution of errors” from your slides appear in this workflow.
Design a relational schema for a simplified social‑media application that supports:
- Users
- Posts
- Likes
- Follows (who follows whom)
  Identify all primary keys, foreign keys, and the types of relationships (1‑1, 1‑many, many‑many).
  Then write:
  a) SQL DDL to create the User and Post tables.
  b) An SQL query to find the top 5 users by number of posts.
Choose two distributions appropriate for modeling:
- Inter‑arrival times of customer support calls
- Daily average website load time (in seconds)
  For each distribution:
  a) Justify why it is suitable.
  b) Show how you would simulate data in Python (NumPy / scipy.stats).
  c) Briefly explain how simulation helps you in model selection during data‑science work.
Design a mini case study where a company wants to reduce customer churn.
- Define the business problem in terms of data science.
- Identify potential data sources (SQL databases, logs, etc.) and how you’d access them with Python.
- Outline steps of the data science process (from your slides) and map each step to specific Python or SQL operations you would perform.
Using your understanding of data quality problems (inaccurate data, poor security, inconsistency, etc.):
- Propose a concrete example of inconsistent data management in a student database.
- Show how you would detect and partially fix this using:
  a) SQL queries (for detection)
  b) Python (pandas) for bulk corrections and validation.
Assume you have a large text log file of web server requests.
Focus on:
- Turning this unstructured log into a structured table suitable for SQL (fields like timestamp, user_id, url, response_time)
- Writing a Python script outline (no need for full code) that:
  • Uses regular expressions / simple parsing to extract fields
  • Stores them in a pandas DataFrame
  • Writes them into an SQL table
- Suggest two simple analyses (with SQL or Python) that would immediately give business value.
Integrate statistical inference and SQL:
- Suppose you have a table Transactions(UserID, Amount, Date) with millions of rows.
- Describe how you would sample data from this table (using SQL) to estimate the average monthly spending per user.
- Then outline, in Python, how you would:
  • Compute confidence intervals for the mean using the Normal or t‑distribution from scipy.stats
  • Check if spending appears normally distributed using a simple visualization or test.

Module 2

Design a step-by-step EDA plan for the Iris dataset starting from loading the data to identifying potential outliers. List the major steps and the key pandas/visualization methods you would use at each step.
You are working with an extended Iris-like dataset that contains additional noisy measurements and some wrongly entered values. Explain in detail how you would:
a) Detect and handle missing values in both numeric and categorical columns
b) Decide when to drop vs impute
c) Document these decisions for reproducibility.
Explain the purpose of each major step of EDA and how it contributes to preparing data for a machine learning model:
- Introducing missing values
- Filling numeric and categorical NaNs
- Dropping duplicates
- Detecting and capping outliers
- Inspecting info() and describe() again at the end.
Assume you’re using the Iris dataset to train a k-Nearest Neighbors (kNN) classifier. Discuss how unclean data (missing values, duplicates, and outliers) can specifically affect kNN’s:
a) Distance calculations
b) Decision boundaries
c) Prediction reliability.
Suppose you introduce synthetic outliers into sepal length (cm) such as extremely large values (e.g., 100 cm). Describe how you would:
a) Detect them using both Z-score and IQR
b) Compare the results of the two methods
c) Decide which method is more appropriate in this context.
You are given two numeric variables from a dataset (not necessarily Iris). Correlation between them is close to 0, but scatter plots show a clear non-linear relationship.
a) Why does Pearson correlation fail here?
b) How would you handle this in your EDA and later in model selection?
Design a mini EDA + cleaning pipeline in pseudocode or Python-outline for a generic tabular dataset (similar to Iris), that includes:
- Loading data
- Inspecting structure and types
- Handling missing values
- Dealing with duplicates
- Detecting and treating outliers
- Splitting into features and target for ML.
Your boss asks: “Why are we spending so much time on EDA and cleaning instead of directly training models like Linear Regression and k-means?”
Frame a detailed, data-science-oriented answer that explains:
a) The philosophy of EDA
b) How EDA reduces risk of incorrect conclusions
c) How it improves later modeling steps and results.
Consider you are performing EDA for a Real Direct (online real-estate firm) case study (as mentioned in the syllabus). Describe:
a) The types of data you expect (features and target)
b) Key EDA checks you would perform (missingness, outliers, correlations)
c) How findings from EDA influence your choice among Linear Regression, kNN, and k-means.
For a dataset with a datetime column (e.g., collection date of samples), explain how you would:
a) Convert it properly to datetime in pandas
b) Extract useful time-based features (year, month, day, etc.)
c) Use these derived features in EDA and in a basic machine learning model such as Linear Regression or kNN.

Go Up

Python for Data Science

VTU - MCS104A

Level 1 Questions (5-6 Marks)

Module 1

Module 2

Level 2 Questions (8 Marks)

Module 1

Chapter 1

Module 2

Level 3 Questions (8-10 Marks)

Module 1

Module 2

mahesha_pandit@sloan.mit.edu