Sklearn标签编码多列pandas数据帧

时间:2017-06-10 14:39:16

标签: python encoding scikit-learn

我尝试在大型pandas数据帧中对包含分类数据("Yes""No")的多个列进行编码。完整的数据帧包含400多列,因此我寻找一种方法来编码所有需要的列,而不必逐个编码。我使用Scikit-learn LabelEncoder对分类数据进行编码。

数据帧的第一部分不必编码,但是我正在寻找一种方法来直接编码包含分类日期的所有所需列,而不需要拆分和连接数据帧。

为了演示我的问题,我首先尝试在数据帧的一小部分上解决它。然而,卡在数据拟合和转换的最后一部分并获得ValueError: bad input shape (4,3)。我跑的代码:

# Create a simple dataframe resembling large dataframe
    data = pd.DataFrame({'A': [1, 2, 3, 4],
                         'B': ["Yes", "No", "Yes", "Yes"],
                         'C': ["Yes", "No", "No", "Yes"],
                         'D': ["No", "Yes", "No", "Yes"]})


# Import required module
from sklearn.preprocessing import LabelEncoder

# Create an object of the label encoder class
labelencoder = LabelEncoder()

# Apply labelencoder object on columns
labelencoder.fit_transform(data.ix[:, 1:])   # First column does not need to be encoded

完整的错误报告:

labelencoder.fit_transform(data.ix[:, 1:])
Traceback (most recent call last):

  File "<ipython-input-47-b4986a719976>", line 1, in <module>
    labelencoder.fit_transform(data.ix[:, 1:])

  File "C:\Anaconda\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py", line 129, in fit_transform
    y = column_or_1d(y, warn=True)

  File "C:\Anaconda\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 562, in column_or_1d
    raise ValueError("bad input shape {0}".format(shape))

ValueError: bad input shape (4, 3)

有谁知道怎么做?

7 个答案:

答案 0 :(得分:8)

如下面的代码,您可以通过将LabelEncoder应用于DataFrame来对多个列进行编码。但请注意,我们无法获取所有列的类信息。

import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({'A': [1, 2, 3, 4],
                   'B': ["Yes", "No", "Yes", "Yes"],
                   'C': ["Yes", "No", "No", "Yes"],
                   'D': ["No", "Yes", "No", "Yes"]})
print(df)
#    A    B    C    D
# 0  1  Yes  Yes   No
# 1  2   No   No  Yes
# 2  3  Yes   No   No
# 3  4  Yes  Yes  Yes

# LabelEncoder
le = LabelEncoder()

# apply "le.fit_transform"
df_encoded = df.apply(le.fit_transform)
print(df_encoded)
#    A  B  C  D
# 0  0  1  1  0
# 1  1  0  0  1
# 2  2  1  0  0
# 3  3  1  1  1

# Note: we cannot obtain the classes information for all columns.
print(le.classes_)
# ['No' 'Yes']

答案 1 :(得分:3)

首先,找出类型为object的所有功能:

objList = all_data.select_dtypes(include = "object").columns
print (objList)

现在,要将上述objList功能转换为数字类型,可以使用如下所示的forloop:

#Label Encoding for object to numeric conversion
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

for feat in objList:
    df[feat] = le.fit_transform(df[feat].astype(str))

print (df.info())

请注意,我们在forloop中明确提到类型字符串,因为如果删除它会引发错误。

答案 2 :(得分:1)

import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelBinarizer
# df is the pandas dataframe
class preprocessing (BaseEstimator, TransformerMixin):
      def __init__ (self, df):
         self.datatypes = df.dtypes.astype(str)
         self.catcolumns = []
         self.cat_encoders = []
         self.encoded_df = []

      def fit (self, df, y = None):
          for ix, val in zip(self.datatypes.index.values, 
          self.datatypes.values):
              if val =='object':
                 self.catcolumns.append(ix)
          fit_objs = [str(i) for i in range(len(self.catcolumns))]
          for encs, name in zip(fit_objs,self.catcolumns):
              encs = LabelBinarizer()
              encs.fit(df[name])
              self.cat_encoders.append((name, encs))
          return self
      def transform (self, df , y = None): 
          for name, encs in self.cat_encoders:
              df_c = encs.transform(df[name])
              self.encoded_df.append(pd.DataFrame(df_c))
          self.encoded_df = pd.concat(self.encoded_df, axis = 1, 
          ignore_index 
          = True)
          self.df_num = df.drop(self.catcolumns, axis = 1)
          y = pd.concat([self.df_num, self.encoded_df], axis = 1, 
          ignore_index = True)
          return y        
# use return y.values to use in sci-kit learn pipeline
""" Finds categorical columns in a dataframe and one hot encodes the 
    columns. you can replace labelbinarizer with labelencoder if you 
    require only label encoding. Function returns encoded categorcial data 
    and numerical data as a dataframe """

答案 3 :(得分:1)

Scikit-learn现在对此有帮助:OrdinalEncoder

from sklearn.preprocessing import OrdinalEncoder
data = pd.DataFrame({'A': [1, 2, 3, 4],
                         'B': ["Yes", "No", "Yes", "Yes"],
                         'C': ["Yes", "No", "No", "Yes"],
                         'D': ["No", "Yes", "No", "Yes"]})

oe = OrdinalEncoder()

t_data = oe.fit_transform(data)
print(t_data)
# [[0. 1. 1. 0.]
# [1. 0. 0. 1.]
# [2. 1. 0. 0.]
# [3. 1. 1. 1.]]

直接使用即可。

答案 4 :(得分:1)

如果您知道列的名称并且不想使用所有列,您可以执行以下操作(您也正在摆脱 for 循环):

categ = ['Pclass','Cabin_Group','Ticket','Embarked']

# Encode Categorical Columns
le = LabelEncoder()
df[categ] = df[categ].apply(le.fit_transform)

答案 5 :(得分:0)

您还可以遍历要对其应用编码的不同列。这种方法可能不是最有效,但效果很好。

from sklearn import preprocessing
LE = preprocessing.LabelEncoder()
for col in df.columns:
    df[col] = LE.fit(df[col])
    df[col] = LE.transform(df[col])
    test_data[col] = LE.transform(test_data[col])

答案 6 :(得分:0)

这是我能写的最简单的:

第 1 步:获取所有分类列:

categorical_columns = train.select_dtypes(['object']).columns

这将存储所有分类列。

Step2:写一个 for 循环来转换,因为 fit_transform 一次只需要 1 个索引。但这是裂缝。

from sklearn.preprocessing import LabelEncoder
label_encoder = preprocessing.LabelEncoder()
for col in train[categorical_columns]:
    train[col]= label_encoder.fit_transform(train[col])

第3步:投票哈哈:)

希望你觉得这有用。