import seaborn as snsExercise: EDA on Penguins
Note: you can Download the notebook from the sidebar, where it says “Jupyter”.
In this notebook exercise, we will conduct simple EDA steps on the popular penguins dataset.
Load the dataset
The following will load the dataset automatically.
# You don't have to download the dataset
# the following will download it for you
df = sns.load_dataset('penguins')Dataset source: https://github.com/allisonhorst/palmerpenguins
df.shape(333, 7)
Step 1 Understand the Features
You can find information about this dataset here: https://www.kaggle.com/datasets/parulpandey/palmer-archipelago-antarctica-penguin-data
Question: in your own words:
- describe each feature
- mention its type (numeric or categorical)
- write its name in Arabic
Please use a Markdown cell to write your answer:
INSERT ANSWER HERE
Hint: you can attach an image to illustrate what the features are.

Step 2
- Have a look at the columns and their values (
head,sample,tail) - Look at the technical information (
info)
# INSERT CODE HEREStep 3
- Calculate count of missing values
- Calculate the percentage of missing values
- For each column, check and handle missing values. You may:
- fill with mean or median
- drop the rows
- Check and handle duplicated rows
- If you chose to drop missing values, how much data did you lose?
Step 4: Data types conversion
- We shall convert the string types to
categoryto preserve memory - numeric types can be stored in less precision:
float32
mem_usage_before = df.memory_usage(deep=True)# convert categotical types
df['species'] = df['species'].astype('category')
# ...?
# ...?# convert numerical types
df['bill_depth_mm'] = df['bill_depth_mm'].astype('float32')
# ...?
# ...?
# ...?Calculate memory saved after type conversion
# mem_usage_after = ...?print('memory saved:', (mem_usage_before - mem_usage_after).sum() // 1024, 'KB')Step 5: Detect inconsistency in categorical values
The categorical columns should be checked for any inconsistencies. For example. We look for lowercase, uppercase, or inconsistent use of codes (e.g., “M”, “F”) with non-codes (e.g., “Male”, “Female”) in the sex column.
- hint: use
.unique()to check the number of unique values in a column - you can also use:
.value_counts()to check the frequency of each value in a column
# INSERT CODE HEREStep 6: Univariate Analysis
- Separate numerical from categorical columns (hint; use
df.select_dtypes()) - Look at the statistical information for each:
df_num.describe().Tdf_cat.describe().T
# INSERT CODE HEREUse charts to plot value_counts() categorical variables:
- plot
speciesusing bar plot - plot
islandusing pie chart - plot
sexusing horizontal bar plot
# INSERT CODE HEREPlot numerical variables:
- Boxplot:
bill_length_mm - Histogram:
bill_depth_mm - Boxplot:
flipper_length_mm - Histogram:
body_mass_g
# INSERT CODE HEREStep 7: Bivariate Analysis
Correlation between numerical features
Let’s find out if there is any correlation between numerical features.
- Hint: you can use the
df.corr()to find the correlation matrix. - Hint: you can use
sns.heatmap()to plot the correlation matrix - Hint: you can use
sns.pairplot()to visualize relationships.
# INSERT CODE HEREQuestion: Write down your observations based on the correlation heatmap.
Observations:
INSERT ANSWER HERE
Looking at these distributions, how hard do you think it would be to classify the penguins only using "culmen depth" and "culmen length"?
INSERT ANSWER HERE
Feature Engineering
- We might try adding the feature
bill_sizewhich is the product ofbill_lengthandbill_depthto see if it has any significance in the model. - We might also try
bill_ratiowhich is the ratio ofbill_lengthtobill_depthto see if it has any significance in the model.
# INSERT CODE HERELet’s look at the correlation to see whether the newly created features are better.
- Compute the correlation matrix
- Select the
'body_mass_g'column, sort it, and plot it using horizontal bar plot
# INSERT CODE HERE
# This plots the correlation values for a specific column
# which is usually what we are interested in
# corr['body_mass_g'].sort_values().plot.barh()