Exercise: Linear Regression

Note: you can Download the notebook from the sidebar, where it says “Jupyter”.

Your task is to do the following:

  1. Load the data
  2. Explore the data
  3. Identify the features and their types
  4. Identify the target and its type
  5. Initialize a regression model
  6. Train the model
  7. Evaluate the model
  8. Make predictions on new cases of your choice
import pandas as pd

1. Hours vs Marks Dataset

Download Hours vs Marks Dataset [Source].

Some important characteristics of the dataset, to ask yourself about:

  • What is the number of samples?
  • What is the distribution of the features? (min, max, mean, median, std, quantiles). Hint: use DataFrame.describe()
  • What is the distribution of the target?
  • Are there missing values?
  • What is the relationship between the feature and the target? Increasing, decreasing, random, or non-linear (changing: maybe, increasing until X, then decreasing)? Plot the feature vs the target to answer this.
pd.read_csv("../datasets/Rounded_Student_Hours_Studied_vs_Marks_Dataset.csv").head()
Hours_Studied Marks
0 4.76 46.27
1 3.00 34.30
2 2.08 33.63
3 4.04 47.81
4 9.49 66.26

2. Experience vs Salary Data

Donwload Experience vs Salary Dataset

Is the relationship between the feature (Experience Years) and the target (Salary) linear?

pd.read_csv("../datasets/Salary Data.csv").head()
Experience Years Salary
0 1.1 39343
1 1.2 42774
2 1.3 46205
3 1.5 37731
4 2.0 43525

3. BMI and Life Expectancy Dataset

Download BMI and Life Expectancy Dataset

Optional: the Country column can be used to group some countries together to find out some underlying patterns that are not directly visible in the data.

pd.read_csv("../datasets/bmi_and_life_expectancy.csv").head()
Country Life expectancy BMI
0 Afghanistan 52.8 20.62058
1 Albania 76.8 26.44657
2 Algeria 75.5 24.59620
3 Andorra 84.6 27.63048
4 Angola 56.7 22.25083