Exercise: Multi-variable Regression 1

The dataset progressively gets more complex:

  1. The Advertising Dataset has only 3 numerical features, with 200 samples
  2. The Auto MPG Dataset has 5 numerical features and 3 categorical features, with 398 samples

Your task is to do the following:

  1. Load the data: pd.read_csv()
  2. Identify the features and their types: df.info()
  3. Identify the target and its type: target_col = ...
  4. Explore the data: stats (df.describe()) and visuals (import seaborn as sns)
  5. Initialize a regression model: SGDRegressor
  6. Train the model: .fit()
  7. Evaluate the model: .score()
  8. Inspect the model weights:
    • Hint: you must pre-process the numerical data to get calibrated weights (StandardScaler)
    • Example: how much does spending on TV ads contributes to Sales? (Hint: model.coef_)
    • Example: how much the factors leave unexplained on what affects Sales? (Hint: model.intercept_)
  9. Make predictions on new cases of your choice: .predict()
import pandas as pd

1. Advertising Dataset

The Advertising Dataset is a fundamental resource in statistical learning and regression analysis. It is primarily known for its use in the first chapter of the seminal textbook “An Introduction to Statistical Learning” (ISLR) by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani.

The dataset is used to illustrate the relationship between advertising budgets across different media and the resulting product sales.

  • Features: 3 numerical
  • Target: sales of the product (in thousands of units).
  • Size: 200 samples.
  • Source: Advertising Dataset
pd.read_csv("../datasets/advertising.csv").head()
TV Radio Newspaper Sales
0 230.1 37.8 69.2 22.1
1 44.5 39.3 45.1 10.4
2 17.2 45.9 69.3 12.0
3 151.5 41.3 58.5 16.5
4 180.8 10.8 58.4 17.9
# INSERT YOUR CODE

2. Auto MPG Dataset

The Auto MPG Dataset is a classic benchmark for regression analysis in machine learning. It originally appeared in the 1983 American Statistical Association (ASA) Exposition and was later donated to the UCI Machine Learning Repository by Ross Quinlan in 1993.

The data consists of technical specifications for various car models from the late 1970s and early 1980s, primarily used to predict fuel efficiency (MPG).

  • Features: 5 numerical, 3 categorical
  • Target: mpg (miles per gallon)
  • Size: 398 samples
  • Source: Auto MPG Dataset
pd.read_csv("../datasets/auto-mpg.csv").head()
mpg cylinders displacement horsepower weight acceleration model year origin car name
0 18.0 8 307.0 130 3504 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165 3693 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150 3436 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150 3433 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140 3449 10.5 70 1 ford torino
# INSERT YOUR CODE

Find more datasets on UCI Machine Learning Repository.