Exercise: EDA on Penguins

Note: you can Download the notebook from the sidebar, where it says “Jupyter”.

In this notebook exercise, we will conduct simple EDA steps on the popular penguins dataset.

Load the dataset

The following will load the dataset automatically.

import seaborn as sns
# You don't have to download the dataset
# the following will download it for you
df = sns.load_dataset('penguins')

Dataset source: https://github.com/allisonhorst/palmerpenguins

df.shape
(333, 7)

Step 1 Understand the Features

You can find information about this dataset here: https://www.kaggle.com/datasets/parulpandey/palmer-archipelago-antarctica-penguin-data

Question: in your own words:

  1. describe each feature
  2. mention its type (numeric or categorical)
  3. write its name in Arabic

Please use a Markdown cell to write your answer:

INSERT ANSWER HERE

Hint: you can attach an image to illustrate what the features are.

Step 2

  • Have a look at the columns and their values (head, sample, tail)
  • Look at the technical information (info)
# INSERT CODE HERE

Step 3

  1. Calculate count of missing values
  2. Calculate the percentage of missing values
  3. For each column, check and handle missing values. You may:
    1. fill with mean or median
    2. drop the rows
  4. Check and handle duplicated rows
  5. If you chose to drop missing values, how much data did you lose?

Step 4: Data types conversion

  • We shall convert the string types to category to preserve memory
  • numeric types can be stored in less precision: float32
mem_usage_before = df.memory_usage(deep=True)
# convert categotical types
df['species'] = df['species'].astype('category')
# ...?
# ...?
# convert numerical types
df['bill_depth_mm'] = df['bill_depth_mm'].astype('float32')
# ...?
# ...?
# ...?

Calculate memory saved after type conversion

# mem_usage_after = ...?
print('memory saved:', (mem_usage_before - mem_usage_after).sum() // 1024, 'KB')

Step 5: Detect inconsistency in categorical values

The categorical columns should be checked for any inconsistencies. For example. We look for lowercase, uppercase, or inconsistent use of codes (e.g., “M”, “F”) with non-codes (e.g., “Male”, “Female”) in the sex column.

  • hint: use .unique() to check the number of unique values in a column
  • you can also use: .value_counts() to check the frequency of each value in a column
# INSERT CODE HERE

Step 6: Univariate Analysis

  • Separate numerical from categorical columns (hint; use df.select_dtypes())
  • Look at the statistical information for each:
    • df_num.describe().T
    • df_cat.describe().T
# INSERT CODE HERE

Use charts to plot value_counts() categorical variables:

  1. plot species using bar plot
  2. plot island using pie chart
  3. plot sex using horizontal bar plot
# INSERT CODE HERE

Plot numerical variables:

  1. Boxplot: bill_length_mm
  2. Histogram: bill_depth_mm
  3. Boxplot: flipper_length_mm
  4. Histogram: body_mass_g
# INSERT CODE HERE

Step 7: Bivariate Analysis

Correlation between numerical features

Let’s find out if there is any correlation between numerical features.

  • Hint: you can use the df.corr() to find the correlation matrix.
  • Hint: you can use sns.heatmap() to plot the correlation matrix
  • Hint: you can use sns.pairplot() to visualize relationships.
# INSERT CODE HERE

Question: Write down your observations based on the correlation heatmap.

Observations:

INSERT ANSWER HERE

Looking at these distributions, how hard do you think it would be to classify the penguins only using "culmen depth" and "culmen length"?

INSERT ANSWER HERE

Feature Engineering

  • We might try adding the feature bill_size which is the product of bill_length and bill_depth to see if it has any significance in the model.
  • We might also try bill_ratio which is the ratio of bill_length to bill_depth to see if it has any significance in the model.
# INSERT CODE HERE

Let’s look at the correlation to see whether the newly created features are better.

  1. Compute the correlation matrix
  2. Select the 'body_mass_g' column, sort it, and plot it using horizontal bar plot
# INSERT CODE HERE
# This plots the correlation values for a specific column
# which is usually what we are interested in

# corr['body_mass_g'].sort_values().plot.barh()