ترميز المقادير المنفصلة (Encoding)

يجب ترميز البيانات الوصفية بصيغة عددية لنتعامل معها في معظم خوارزميات تعلم الآلة.

فأما الأوصاف المرتبة (Ordinal): كالأحجام (صغير، وسط، كبير) أو المراكز (الأول، الثاني، الثالث) ونحو ذلك فيتم ترميزها بالأعداد مرتَّبة على نحو: (1, 2, 3) باستعمال OrdinalEncoder.

وهي وغير المرتَّبة (Nominal): يمكن ترميزهما بالصفر والواحد، وجودًا وعدمًا، بعدد القيَم الفريدة للصفة. ويستعمل لذلك OneHotEncoder.

import numpy as np
import pandas as pd
from sklearn import preprocessing

المُرمِّز الرُّتَبي (OrdinalEncoder)

from sklearn.preprocessing import OrdinalEncoder

# Specify the order of categories
categories = [
    ['low', 'medium', 'high'], # categories of first feature
    ['1st', '2nd', '3rd'],     # categories of second feature
]

encoder = OrdinalEncoder(
    categories=categories,
    handle_unknown='use_encoded_value', # means that if we encounter an unknown category, we will encode it as a specific value
    unknown_value=-1
)
# We want a pandas DataFrame as output rather than a NumPy array (default)
encoder = encoder.set_output(transform='pandas')

df = pd.DataFrame({
    'risk': ['low', 'medium', 'low', 'low', 'high'],
    'class': ['1st', '3rd', '2nd', '1st', '3rd'],
})
df

	risk	class
0	low	1st
1	medium	3rd
2	low	2nd
3	low	1st
4	high	3rd

encoder.fit_transform(df)

	risk	class
0	0.0	0.0
1	1.0	2.0
2	0.0	1.0
3	0.0	0.0
4	2.0	2.0

المُرمِّز الأحادي (OneHotEncoder)

from sklearn.preprocessing import OneHotEncoder

# Create a OneHotEncoder instance
encoder = OneHotEncoder(
    handle_unknown='infrequent_if_exist',
    sparse_output=False,    # <-- output is a dense array
)
encoder.set_output(transform='pandas')

OneHotEncoder(handle_unknown='infrequent_if_exist', sparse_output=False)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

df_train = pd.DataFrame({
    'color': ['Red', 'Blue', 'Green', 'Green']
})
df_train

	color
0	Red
1	Blue
2	Green
3	Green

# Fit and transform the data
encoder.fit_transform(df_train)

	color_Blue	color_Green	color_Red
0	0.0	0.0	1.0
1	1.0	0.0	0.0
2	0.0	1.0	0.0
3	0.0	1.0	0.0

df_test = pd.DataFrame({
    'color': ['Blue', 'Green', 'dragonfruit']
})
df_test

	color
0	Blue
1	Green
2	dragonfruit

# 4. Transform
result = encoder.transform(df_test)
result

	color_Blue	color_Green
0	1.0	0.0
1	0.0	1.0
2	0.0	0.0

مشكلة الذاكرة في التمثيل الأحادي

ومما يجب التنبيه عليه: أن كل صفة فريدة يتم إنشاء عمود مستقل لها. وذلك قد يتسبب في امتلاء الذاكرة، لذلك فإننا نحدُّ من عدد الصفات باعتبار القيَم الفريدة المتكررة أكثر من min_frequencies من المرات، ثم حدُّ عدد الصفات بـ max_categories.

“على سبيل المثال، سنقوم بحساب عدد الأعمدة وحجم الذاكرة المضافة الناتجة عن ترميز المتغيرات الوصفية في مجموعة بيانات Ames Housing:”

المعيار	قبل الترميز	بعد الترميز الأحادي	الزيادة
عدد الأعمدة	~46	~318	~6.9x
الحجم	~0.15 MB	~7.10 MB	~47.3x

لتفاصيل إنشاء هذا الجودل، راجع: إبراز مشكلة الذاكرة في الترميز الأحادي.

الحل

استخدم min_frequency لتقليل عدد الأعمدة عن طريق تجميع الفئات غير المتكررة في عمود واحد.
استخدم max_categories لتحديد الحد الأقصى لعدد الأعمدة الإجمالي (بما في ذلك عمود الفئات النادرة).

يقبل المرمز الأحادي (OneHotEncoder) تجميع الفئات النادرة في مخرج واحد لكل ميزة، كما هو موضح في الجدول أدناه:

العامل	النوع	القاعدة	الوصف
`min_frequency`	`int`	\(\ge 1\)	تُعتبر الفئات التي يقل تكرارها عن هذا العدد الصحيح فئات نادرة.
_	`float`	\((0.0, 1.0)\)	تُعتبر الفئات التي يقل تكرارها عن هذه النسبة من إجمالي العينات فئات نادرة.
`max_categories`	`int`	\(> 1\)	يضع حداً أقصى لإجمالي عدد الميزات الناتجة (output features)، بما في ذلك فئة “النادرة”.
_	`None`	افتراضي	لا يوجد حد أقصى لعدد الميزات الناتجة.

X = np.array([
    ['cat'] * 20 +
    ['rabbit'] * 10 +
    ['snake'] * 6 +
    ['dragon'] * 3 +
    ['dinosaur'] * 2
], dtype=str).T
X.shape

(41, 1)

تذكر: أن ضرب القائمة بعدد صحيح يكرر العناصر. وجمع القائمة مع القائمة يدمجهما في قائمة جديدة.

enc = preprocessing.OneHotEncoder(
    min_frequency=6,
    max_categories=3,
    handle_unknown='infrequent_if_exist',
    sparse_output=False,
)
enc.set_output(transform='pandas')
enc.fit(X)

OneHotEncoder(handle_unknown='infrequent_if_exist', max_categories=3,
              min_frequency=6, sparse_output=False)

لاحظ كيف سيتم تعيين كل من 'dragon' و 'dinosaur' إلى الترميز: [0., 0., 1.] (وهو الترميز المشترك لجميع الفئات النادرة).
وفقاً لآلية عمل OneHotEncoder ،بما أننا قيدنا معامل max_categories بـ 3، فإن 'snake' لم تُدرج كفئة منفصلة رغم استيفائها لشرط min_frequency؛ وذلك لأن الفئة الثالثة حُجزت لتمثيل الفئات النادرة.

enc.transform(np.array([
    ['rabbit'],
    ['rabbit'],
    ['cat'],
    ['snake'],
    ['dragon'],
    ['dinosaur'],
]))

	x0_cat	x0_rabbit	x0_infrequent_sklearn
0	0.0	1.0	0.0
1	0.0	1.0	0.0
2	1.0	0.0	0.0
3	0.0	0.0	1.0
4	0.0	0.0	1.0
5	0.0	0.0	1.0

print("Categories:", enc.categories_)
print("Infrequent categories:", enc.infrequent_categories_)

Categories: [array(['cat', 'dinosaur', 'dragon', 'rabbit', 'snake'], dtype='<U8')]
Infrequent categories: [array(['dinosaur', 'dragon', 'snake'], dtype='<U8')]

الترميز الجغرافي (Geocoding) يُنتج معلومات خطوط الطول والعرض، قد يكون خياراً أفضل في كثير من الأحيان للأوصاف المكانية.
للمزيد، راجع: 7.3.4. Encoding categorical features.

	categories categories: 'auto' or a list of array-like, default='auto' Categories (unique values) per feature: - 'auto' : Determine categories automatically from the training data. - list : ``categories[i]`` holds the categories expected in the ith column. The passed categories should not mix strings and numeric values within a single feature, and should be sorted in case of numeric values. The used categories can be found in the ``categories_`` attribute. .. versionadded:: 0.20	'auto'
	drop drop: {'first', 'if_binary'} or an array-like of shape (n_features,), default=None Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into an unregularized linear regression model. However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models. - None : retain all features (the default). - 'first' : drop the first category in each feature. If only one category is present, the feature will be dropped entirely. - 'if_binary' : drop the first category in each feature with two categories. Features with 1 or more than 2 categories are left intact. - array : ``drop[i]`` is the category in feature ``X[:, i]`` that should be dropped. When `max_categories` or `min_frequency` is configured to group infrequent categories, the dropping behavior is handled after the grouping. .. versionadded:: 0.21 The parameter `drop` was added in 0.21. .. versionchanged:: 0.23 The option `drop='if_binary'` was added in 0.23. .. versionchanged:: 1.1 Support for dropping infrequent categories.	None
	sparse_output sparse_output: bool, default=True When ``True``, it returns a :class:`scipy.sparse.csr_matrix`, i.e. a sparse matrix in "Compressed Sparse Row" (CSR) format. .. versionadded:: 1.2 `sparse` was renamed to `sparse_output`	False
	dtype dtype: number type, default=np.float64 Desired dtype of output.	<class 'numpy.float64'>
	handle_unknown handle_unknown: {'error', 'ignore', 'infrequent_if_exist', 'warn'}, default='error' Specifies the way unknown categories are handled during :meth:`transform`. - 'error' : Raise an error if an unknown category is present during transform. - 'ignore' : When an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None. - 'infrequent_if_exist' : When an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will map to the infrequent category if it exists. The infrequent category will be mapped to the last position in the encoding. During inverse transform, an unknown category will be mapped to the category denoted `'infrequent'` if it exists. If the `'infrequent'` category does not exist, then :meth:`transform` and :meth:`inverse_transform` will handle an unknown category as with `handle_unknown='ignore'`. Infrequent categories exist based on `min_frequency` and `max_categories`. Read more in the :ref:`User Guide `. - 'warn' : When an unknown category is encountered during transform a warning is issued, and the encoding then proceeds as described for `handle_unknown="infrequent_if_exist"`. .. versionchanged:: 1.1 `'infrequent_if_exist'` was added to automatically handle unknown categories and infrequent categories. .. versionadded:: 1.6 The option `"warn"` was added in 1.6.	'infrequent_if_exist'
	min_frequency min_frequency: int or float, default=None Specifies the minimum frequency below which a category will be considered infrequent. - If `int`, categories with a smaller cardinality will be considered infrequent. - If `float`, categories with a smaller cardinality than `min_frequency * n_samples` will be considered infrequent. .. versionadded:: 1.1 Read more in the :ref:`User Guide `.	None
	max_categories max_categories: int, default=None Specifies an upper limit to the number of output features for each input feature when considering infrequent categories. If there are infrequent categories, `max_categories` includes the category representing the infrequent categories along with the frequent categories. If `None`, there is no limit to the number of output features. .. versionadded:: 1.1 Read more in the :ref:`User Guide `.	None
	feature_name_combiner feature_name_combiner: "concat" or callable, default="concat" Callable with signature `def callable(input_feature, category)` that returns a string. This is used to create feature names to be returned by :meth:`get_feature_names_out`. `"concat"` concatenates encoded feature name and category with `feature + "_" + str(category)`.E.g. feature X with values 1, 6, 7 create feature names `X_1, X_6, X_7`. .. versionadded:: 1.3	'concat'

	categories categories: 'auto' or a list of array-like, default='auto' Categories (unique values) per feature: - 'auto' : Determine categories automatically from the training data. - list : ``categories[i]`` holds the categories expected in the ith column. The passed categories should not mix strings and numeric values within a single feature, and should be sorted in case of numeric values. The used categories can be found in the ``categories_`` attribute. .. versionadded:: 0.20	'auto'
	drop drop: {'first', 'if_binary'} or an array-like of shape (n_features,), default=None Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into an unregularized linear regression model. However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models. - None : retain all features (the default). - 'first' : drop the first category in each feature. If only one category is present, the feature will be dropped entirely. - 'if_binary' : drop the first category in each feature with two categories. Features with 1 or more than 2 categories are left intact. - array : ``drop[i]`` is the category in feature ``X[:, i]`` that should be dropped. When `max_categories` or `min_frequency` is configured to group infrequent categories, the dropping behavior is handled after the grouping. .. versionadded:: 0.21 The parameter `drop` was added in 0.21. .. versionchanged:: 0.23 The option `drop='if_binary'` was added in 0.23. .. versionchanged:: 1.1 Support for dropping infrequent categories.	None
	sparse_output sparse_output: bool, default=True When ``True``, it returns a :class:`scipy.sparse.csr_matrix`, i.e. a sparse matrix in "Compressed Sparse Row" (CSR) format. .. versionadded:: 1.2 `sparse` was renamed to `sparse_output`	False
	dtype dtype: number type, default=np.float64 Desired dtype of output.	<class 'numpy.float64'>
	handle_unknown handle_unknown: {'error', 'ignore', 'infrequent_if_exist', 'warn'}, default='error' Specifies the way unknown categories are handled during :meth:`transform`. - 'error' : Raise an error if an unknown category is present during transform. - 'ignore' : When an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None. - 'infrequent_if_exist' : When an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will map to the infrequent category if it exists. The infrequent category will be mapped to the last position in the encoding. During inverse transform, an unknown category will be mapped to the category denoted `'infrequent'` if it exists. If the `'infrequent'` category does not exist, then :meth:`transform` and :meth:`inverse_transform` will handle an unknown category as with `handle_unknown='ignore'`. Infrequent categories exist based on `min_frequency` and `max_categories`. Read more in the :ref:`User Guide `. - 'warn' : When an unknown category is encountered during transform a warning is issued, and the encoding then proceeds as described for `handle_unknown="infrequent_if_exist"`. .. versionchanged:: 1.1 `'infrequent_if_exist'` was added to automatically handle unknown categories and infrequent categories. .. versionadded:: 1.6 The option `"warn"` was added in 1.6.	'infrequent_if_exist'
	min_frequency min_frequency: int or float, default=None Specifies the minimum frequency below which a category will be considered infrequent. - If `int`, categories with a smaller cardinality will be considered infrequent. - If `float`, categories with a smaller cardinality than `min_frequency * n_samples` will be considered infrequent. .. versionadded:: 1.1 Read more in the :ref:`User Guide `.	6
	max_categories max_categories: int, default=None Specifies an upper limit to the number of output features for each input feature when considering infrequent categories. If there are infrequent categories, `max_categories` includes the category representing the infrequent categories along with the frequent categories. If `None`, there is no limit to the number of output features. .. versionadded:: 1.1 Read more in the :ref:`User Guide `.	3
	feature_name_combiner feature_name_combiner: "concat" or callable, default="concat" Callable with signature `def callable(input_feature, category)` that returns a string. This is used to create feature names to be returned by :meth:`get_feature_names_out`. `"concat"` concatenates encoded feature name and category with `feature + "_" + str(category)`.E.g. feature X with values 1, 6, 7 create feature names `X_1, X_6, X_7`. .. versionadded:: 1.3	'concat'