简短回答

Question

我有80％分类变量的机器学习分类问题。如果我想使用某种分类器进行分类，我必须使用一个热编码吗？我可以在没有编码的情况下将数据传递给分类器吗？

我正在尝试执行以下功能选择：

我读了火车档案：

num_rows_to_read = 10000
train_small = pd.read_csv("../../dataset/train.csv",   nrows=num_rows_to_read)

我将分类功能的类型更改为＆＃39;类别＆＃39;：

non_categorial_features = ['orig_destination_distance',
                          'srch_adults_cnt',
                          'srch_children_cnt',
                          'srch_rm_cnt',
                          'cnt']

for categorical_feature in list(train_small.columns):
    if categorical_feature not in non_categorial_features:
        train_small[categorical_feature] = train_small[categorical_feature].astype('category')

我使用一个热门编码：

train_small_with_dummies = pd.get_dummies(train_small, sparse=True)

问题在于，虽然我使用的是强机，但第3部分经常会卡住。

因此，如果没有热门编码，我就无法进行任何功能选择，以确定功能的重要性。

你推荐什么？

Answer 1

方法1：你可以在pandas数据帧上使用get_dummies。

示例1：

import pandas as pd
s = pd.Series(list('abca'))
pd.get_dummies(s)
Out[]: 
     a    b    c
0  1.0  0.0  0.0
1  0.0  1.0  0.0
2  0.0  0.0  1.0
3  1.0  0.0  0.0

示例2：

以下内容将给定列转换为一个热点。使用前缀可以有多个假人。

import pandas as pd

df = pd.DataFrame({
          'A':['a','b','a'],
          'B':['b','a','c']
        })
df
Out[]: 
   A  B
0  a  b
1  b  a
2  a  c

# Get one hot encoding of columns B
one_hot = pd.get_dummies(df['B'])
# Drop column B as it is now encoded
df = df.drop('B',axis = 1)
# Join the encoded df
df = df.join(one_hot)
df  
Out[]: 
       A  a  b  c
    0  a  0  1  0
    1  b  1  0  0
    2  a  0  0  1

方法2：使用Scikit-learn

给定具有三个特征和四个样本的数据集，我们让编码器找到每个特征的最大值，并将数据转换为二进制单热编码。

>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])   
OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
   handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9], dtype=int32)
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])

以下是此示例的链接：http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

Answer 2

您可以使用numpy.eye和a使用数组元素选择机制来执行此操作：

import numpy as np
nb_classes = 6
data = [[2, 3, 4, 0]]

def indices_to_one_hot(data, nb_classes):
    """Convert an iterable of indices to one-hot encoded labels."""
    targets = np.array(data).reshape(-1)
    return np.eye(nb_classes)[targets]

indices_to_one_hot(nb_classes, data)的返回值现在是

array([[[ 0.,  0.,  1.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  1.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  1.,  0.],
        [ 1.,  0.,  0.,  0.,  0.,  0.]]])

.reshape(-1)用于确保您拥有正确的标签格式（您可能还有[[2], [3], [4], [0]]）。

Answer 3

首先，一个热门编码的最简单方法是：使用Sklearn。

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

其次，我不认为将pandas用于一个热门编码就是这么简单（虽然未经证实）

Creating dummy variables in pandas for python

最后，你需要一个热门编码吗？一个热编码会以指数方式增加功能的数量，从而大大增加任何分类器或您将要运行的任何其他内容的运行时间。特别是当每个分类特征具有多个级别时。相反，你可以做虚拟编码。

使用虚拟编码通常效果很好，运行时间和复杂性要低得多。一位聪明的教授曾经告诉过我，“少即是多”＆＃39;。

如果需要，可以使用自定义编码功能的代码。

from sklearn.preprocessing import LabelEncoder

#Auto encodes any dataframe column of type category or object.
def dummyEncode(df):
        columnsToEncode = list(df.select_dtypes(include=['category','object']))
        le = LabelEncoder()
        for feature in columnsToEncode:
            try:
                df[feature] = le.fit_transform(df[feature])
            except:
                print('Error encoding '+feature)
        return df

编辑：比较更清楚：

单热编码：将n级转换为n-1列。

Index  Animal         Index  cat  mouse
  1     dog             1     0     0
  2     cat       -->   2     1     0
  3    mouse            3     0     1

如果您的分类功能中有许多不同类型（或级别），您可以看到这会如何破坏您的记忆。请记住，这只是一栏。

虚拟编码：

Index  Animal         Index  Animal
  1     dog             1      0   
  2     cat       -->   2      1 
  3    mouse            3      2

转换为数字表示。大大节省了功能空间，但需要付出一定的准确性。

Answer 4

使用pandas进行一次热门编码非常简单：

def one_hot(df, cols):
    """
    @param df pandas DataFrame
    @param cols a list of columns to encode 
    @return a DataFrame with one-hot encoding
    """
    for each in cols:
        dummies = pd.get_dummies(df[each], prefix=each, drop_first=False)
        df = pd.concat([df, dummies], axis=1)
    return df

编辑：

使用sklearn＆＃39; s LabelBinarizer：

的另一种方法

from sklearn.preprocessing import LabelBinarizer 
label_binarizer = LabelBinarizer()
label_binarizer.fit(all_your_labels_list) # need to be global or remembered to use it later

def one_hot_encode(x):
    """
    One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
    : x: List of sample Labels
    : return: Numpy array of one-hot encoded labels
    """
    return label_binarizer.transform(x)

Answer 5

您可以使用numpy.eye功能。

import numpy as np

def one_hot_encode(x, n_classes):
    """
    One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
    : x: List of sample Labels
    : return: Numpy array of one-hot encoded labels
     """
    return np.eye(n_classes)[x]

def main():
    list = [0,1,2,3,4,3,2,1,0]
    n_classes = 5
    one_hot_list = one_hot_encode(list, n_classes)
    print(one_hot_list)

if __name__ == "__main__":
    main()

结果

D:\Desktop>python test.py
[[ 1.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.]
 [ 0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  1.]
 [ 0.  0.  0.  1.  0.]
 [ 0.  0.  1.  0.  0.]
 [ 0.  1.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.]]

Answer 6

使用熊猫轻松进行基本的一键编码。如果您正在寻找更多选项，可以使用scikit-learn。

对于使用熊猫的基本一键编码，您只需将数据框传递到 get_dummies 函数中。

例如，如果我有一个名为 imdb_movies 的数据框：

...并且我想对Rating列进行一次热编码，我只需这样做：

pd.get_dummies(imdb_movies.Rated)

这将返回一个新的数据框，其中包含每个存在的每个“ 等级”等级的列，以及1或0（用于指定给定观察值的等级）。

通常，我们希望它成为原始数据框的一部分。在这种情况下，我们只需使用“ 列绑定”将新的伪编码帧附加到原始帧。

我们可以使用熊猫 concat 函数进行列绑定：

rated_dummies = pd.get_dummies(imdb_movies.Rated)
pd.concat([imdb_movies, rated_dummies], axis=1)

我们现在可以对整个数据框进行分析。

简单实用功能

我建议您将自己设为实用程序功能，以便快速完成此操作：

def encode_and_bind(original_dataframe, feature_to_encode):
    dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
    res = pd.concat([original_dataframe, dummies], axis=1)
    return(res)

用法：

encode_and_bind(imdb_movies, 'Rated')

结果：

此外，按照@pmalbu注释，如果您希望该功能删除原始的feature_to_encode ，请使用以下版本：

def encode_and_bind(original_dataframe, feature_to_encode):
    dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
    res = pd.concat([original_dataframe, dummies], axis=1)
    res = res.drop([feature_to_encode], axis=1)
    return(res)

Answer 7

以下是使用DictVectorizer和Pandas DataFrame.to_dict('records')方法的解决方案。

>>> import pandas as pd
>>> X = pd.DataFrame({'income': [100000,110000,90000,30000,14000,50000],
                      'country':['US', 'CAN', 'US', 'CAN', 'MEX', 'US'],
                      'race':['White', 'Black', 'Latino', 'White', 'White', 'Black']
                     })

>>> from sklearn.feature_extraction import DictVectorizer
>>> v = DictVectorizer()
>>> qualitative_features = ['country','race']
>>> X_qual = v.fit_transform(X[qualitative_features].to_dict('records'))
>>> v.vocabulary_
{'country=CAN': 0,
 'country=MEX': 1,
 'country=US': 2,
 'race=Black': 3,
 'race=Latino': 4,
 'race=White': 5}

>>> X_qual.toarray()
array([[ 0.,  0.,  1.,  0.,  0.,  1.],
       [ 1.,  0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  1.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  1.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  1.,  1.,  0.,  0.]])

Answer 8

pandas具有内置的功能“ get_dummies”，可以对该特定列进行一次热编码。

一个热编码的一行代码：

df=pd.concat([df,pd.get_dummies(df['column name'],prefix='column name')],axis=1).drop(['column name'],axis=1)

Answer 9

您可以将数据传递给catboost分类器，而无需进行编码。 Catboost通过执行一键和目标扩展均值编码来自行处理分类变量。

Answer 10

我知道我来晚了，但是以自动化方式对数据帧进行热编码的最简单方法是使用此功能：

def hot_encode(df):
    obj_df = df.select_dtypes(include=['object'])
    return pd.get_dummies(df, columns=obj_df.columns).values

Answer 11

您也可以执行以下操作。请注意以下内容，您不必使用pd.concat。

import pandas as pd 
# intialise data of lists. 
data = {'Color':['Red', 'Yellow', 'Red', 'Yellow'], 'Length':[20.1, 21.1, 19.1, 18.1],
       'Group':[1,2,1,2]} 

# Create DataFrame 
df = pd.DataFrame(data) 

for _c in df.select_dtypes(include=['object']).columns:
    print(_c)
    df[_c]  = pd.Categorical(df[_c])
df_transformed = pd.get_dummies(df)
df_transformed

您还可以将显式列更改为分类。例如，在这里我要更改Color和Group

import pandas as pd 
# intialise data of lists. 
data = {'Color':['Red', 'Yellow', 'Red', 'Yellow'], 'Length':[20.1, 21.1, 19.1, 18.1],
       'Group':[1,2,1,2]} 

# Create DataFrame 
df = pd.DataFrame(data) 
columns_to_change = list(df.select_dtypes(include=['object']).columns)
columns_to_change.append('Group')
for _c in columns_to_change:
    print(_c)
    df[_c]  = pd.Categorical(df[_c])
df_transformed = pd.get_dummies(df)
df_transformed

Answer 12

我在我的声学模型中使用了这个：可能这对你的模型有帮助。

def one_hot_encoding(x, n_out):
    x = x.astype(int)  
    shape = x.shape
    x = x.flatten()
    N = len(x)
    x_categ = np.zeros((N,n_out))
    x_categ[np.arange(N), x] = 1
    return x_categ.reshape((shape)+(n_out,))

Answer 13

这对我有用：

pandas.factorize( ['B', 'C', 'D', 'B'] )[0]

输出：

[0, 1, 2, 0]

Answer 14

要添加其他问题，让我提供一下使用Numpy的Python 2.0函数的方法：

def one_hot(y_):
    # Function to encode output labels from number indexes 
    # e.g.: [[5], [0], [3]] --> [[0, 0, 0, 0, 0, 1], [1, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0]]

    y_ = y_.reshape(len(y_))
    n_values = np.max(y_) + 1
    return np.eye(n_values)[np.array(y_, dtype=np.int32)]  # Returns FLOATS

如果您使用迷你批次，行n_values = np.max(y_) + 1可能会被硬编码，以便您使用大量神经元。

使用此功能的演示项目/教程： https://github.com/guillaume-chevalier/LSTM-Human-Activity-Recognition

Answer 15

可以并且应该很容易：

class OneHotEncoder:
    def __init__(self,optionKeys):
        length=len(optionKeys)
        self.__dict__={optionKeys[j]:[0 if i!=j else 1 for i in range(length)] for j in range(length)}

用法：

ohe=OneHotEncoder(["A","B","C","D"])
print(ohe.A)
print(ohe.D)

Answer 16

扩展@Martin Thoma的答案

def one_hot_encode(y):
    """Convert an iterable of indices to one-hot encoded labels."""
    y = y.flatten() # Sometimes not flattened vector is passed e.g (118,1) in these cases
    # the function ends up creating a tensor e.g. (118, 2, 1). flatten removes this issue
    nb_classes = len(np.unique(y)) # get the number of unique classes
    standardised_labels = dict(zip(np.unique(y), np.arange(nb_classes))) # get the class labels as a dictionary
    # which then is standardised. E.g imagine class labels are (4,7,9) if a vector of y containing 4,7 and 9 is
    # directly passed then np.eye(nb_classes)[4] or 7,9 throws an out of index error.
    # standardised labels fixes this issue by returning a dictionary;
    # standardised_labels = {4:0, 7:1, 9:2}. The values of the dictionary are mapped to keys in y array.
    # standardised_labels also removes the error that is raised if the labels are floats. E.g. 1.0; element
    # cannot be called by an integer index e.g y[1.0] - throws an index error.
    targets = np.vectorize(standardised_labels.get)(y) # map the dictionary values to array.
    return np.eye(nb_classes)[targets]

Answer 17

简短回答

这里是使用numpy，pandas或其他软件包对不加进行一次热编码的功能。它需要一个整数，布尔值或字符串（可能还有其他类型）的列表。

import typing


def one_hot_encode(items: list) -> typing.List[list]:
    results = []
    # find the unique items (we want to unique items b/c duplicate items will have the same encoding)
    unique_items = list(set(items))
    # sort the unique items
    sorted_items = sorted(unique_items)
    # find how long the list of each item should be
    max_index = len(unique_items)

    for item in items:
        # create a list of zeros the appropriate length
        one_hot_encoded_result = [0 for i in range(0, max_index)]
        # find the index of the item
        one_hot_index = sorted_items.index(item)
        # change the zero at the index from the previous line to a one
        one_hot_encoded_result[one_hot_index] = 1
        # add the result
        results.append(one_hot_encoded_result)

    return results

示例：

one_hot_encode([2, 1, 1, 2, 5, 3])

# [[0, 1, 0, 0],
#  [1, 0, 0, 0],
#  [1, 0, 0, 0],
#  [0, 1, 0, 0],
#  [0, 0, 0, 1],
#  [0, 0, 1, 0]]

one_hot_encode([True, False, True])

# [[0, 1], [1, 0], [0, 1]]

one_hot_encode(['a', 'b', 'c', 'a', 'e'])

# [[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [1, 0, 0, 0], [0, 0, 0, 1]]

长答案

我知道这个问题已经有了很多答案，但是我注意到了两点。首先，大多数答案都使用numpy和/或pandas之类的软件包。这是一件好事。如果要编写生产代码，则可能应该使用健壮，快速的算法，例如numpy / pandas软件包中提供的算法。但是，出于教育的目的，我认为应该提供一个答案，该答案具有透明的算法，而不仅仅是其他人算法的实现。其次，我注意到许多答案没有提供可靠的一键编码实现，因为它们不满足以下要求之一。以下是一些有用，准确且健壮的一键编码功能的要求（如我所见）：

一键编码功能必须：

处理各种类型的列表（例如整数，字符串，浮点数等）作为输入
处理包含重复项的输入列表
返回与输入相对应的列表列表（顺序相同）
返回列表列表，其中每个列表都尽可能短

我测试了这个问题的许多答案，但大多数都无法满足上述要求之一。

Answer 18

尝试一下：

!pip install category_encoders
import category_encoders as ce

categorical_columns = [...the list of names of the columns you want to one-hot-encode ...]
encoder = ce.OneHotEncoder(cols=categorical_columns, use_cat_names=True)
df_train_encoded = encoder.fit_transform(df_train_small)

df_encoded.head（）

生成的数据帧df_train_encoded与原始数据帧相同，但是现在将分类功能替换为它们的一键编码版本。

有关category_encoders here的更多信息。

Answer 19

我试过这种方法：

import numpy as np
#converting to one_hot





def one_hot_encoder(value, datal):

    datal[value] = 1

    return datal


def _one_hot_values(labels_data):
    encoded = [0] * len(labels_data)

    for j, i in enumerate(labels_data):
        max_value = [0] * (np.max(labels_data) + 1)

        encoded[j] = one_hot_encoder(i, max_value)

    return np.array(encoded)

我怎样才能在Python中进行热编码？

19 个答案:

简短回答

长答案