Exercise: Multi-variable Regression 1

The dataset progressively gets more complex:

The Advertising Dataset has only 3 numerical features, with 200 samples
The Auto MPG Dataset has 5 numerical features and 3 categorical features, with 398 samples

Your task is to do the following:

Load the data: pd.read_csv()
Identify the features and their types: df.info()
Identify the target and its type: target_col = ...
Explore the data: stats (df.describe()) and visuals (import seaborn as sns)
Initialize a regression model: SGDRegressor
Train the model: .fit()
Evaluate the model: .score()
Inspect the model weights:
- Hint: you must pre-process the numerical data to get calibrated weights (StandardScaler)
- Example: how much does spending on TV ads contributes to Sales? (Hint: model.coef_)
- Example: how much the factors leave unexplained on what affects Sales? (Hint: model.intercept_)
Make predictions on new cases of your choice: .predict()

import pandas as pd

1. Advertising Dataset

The Advertising Dataset is a fundamental resource in statistical learning and regression analysis. It is primarily known for its use in the first chapter of the seminal textbook “An Introduction to Statistical Learning” (ISLR) by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani.

The dataset is used to illustrate the relationship between advertising budgets across different media and the resulting product sales.

Features: 3 numerical
Target: sales of the product (in thousands of units).
Size: 200 samples.
Source: Advertising Dataset

pd.read_csv("../datasets/advertising.csv").head()

	TV	Radio	Newspaper	Sales
0	230.1	37.8	69.2	22.1
1	44.5	39.3	45.1	10.4
2	17.2	45.9	69.3	12.0
3	151.5	41.3	58.5	16.5
4	180.8	10.8	58.4	17.9

# INSERT YOUR CODE

2. Auto MPG Dataset

The Auto MPG Dataset is a classic benchmark for regression analysis in machine learning. It originally appeared in the 1983 American Statistical Association (ASA) Exposition and was later donated to the UCI Machine Learning Repository by Ross Quinlan in 1993.

The data consists of technical specifications for various car models from the late 1970s and early 1980s, primarily used to predict fuel efficiency (MPG).

Features: 5 numerical, 3 categorical
Target: mpg (miles per gallon)
Size: 398 samples
Source: Auto MPG Dataset

pd.read_csv("../datasets/auto-mpg.csv").head()

	mpg	cylinders	displacement	horsepower	weight	acceleration	model year	origin	car name
0	18.0	8	307.0	130	3504	12.0	70	1	chevrolet chevelle malibu
1	15.0	8	350.0	165	3693	11.5	70	1	buick skylark 320
2	18.0	8	318.0	150	3436	11.0	70	1	plymouth satellite
3	16.0	8	304.0	150	3433	12.0	70	1	amc rebel sst
4	17.0	8	302.0	140	3449	10.5	70	1	ford torino

# INSERT YOUR CODE

Find more datasets on UCI Machine Learning Repository.