我尝试在大型pandas数据帧中对包含分类数据("Yes"
和"No"
)的多个列进行编码。完整的数据帧包含400多列,因此我寻找一种方法来编码所有需要的列,而不必逐个编码。我使用Scikit-learn LabelEncoder
对分类数据进行编码。
数据帧的第一部分不必编码,但是我正在寻找一种方法来直接编码包含分类日期的所有所需列,而不需要拆分和连接数据帧。
为了演示我的问题,我首先尝试在数据帧的一小部分上解决它。然而,卡在数据拟合和转换的最后一部分并获得ValueError: bad input shape (4,3)
。我跑的代码:
# Create a simple dataframe resembling large dataframe
data = pd.DataFrame({'A': [1, 2, 3, 4],
'B': ["Yes", "No", "Yes", "Yes"],
'C': ["Yes", "No", "No", "Yes"],
'D': ["No", "Yes", "No", "Yes"]})
# Import required module
from sklearn.preprocessing import LabelEncoder
# Create an object of the label encoder class
labelencoder = LabelEncoder()
# Apply labelencoder object on columns
labelencoder.fit_transform(data.ix[:, 1:]) # First column does not need to be encoded
完整的错误报告:
labelencoder.fit_transform(data.ix[:, 1:])
Traceback (most recent call last):
File "<ipython-input-47-b4986a719976>", line 1, in <module>
labelencoder.fit_transform(data.ix[:, 1:])
File "C:\Anaconda\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py", line 129, in fit_transform
y = column_or_1d(y, warn=True)
File "C:\Anaconda\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 562, in column_or_1d
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (4, 3)
有谁知道怎么做?
答案 0 :(得分:8)
如下面的代码,您可以通过将LabelEncoder
应用于DataFrame来对多个列进行编码。但请注意,我们无法获取所有列的类信息。
import pandas as pd
from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame({'A': [1, 2, 3, 4],
'B': ["Yes", "No", "Yes", "Yes"],
'C': ["Yes", "No", "No", "Yes"],
'D': ["No", "Yes", "No", "Yes"]})
print(df)
# A B C D
# 0 1 Yes Yes No
# 1 2 No No Yes
# 2 3 Yes No No
# 3 4 Yes Yes Yes
# LabelEncoder
le = LabelEncoder()
# apply "le.fit_transform"
df_encoded = df.apply(le.fit_transform)
print(df_encoded)
# A B C D
# 0 0 1 1 0
# 1 1 0 0 1
# 2 2 1 0 0
# 3 3 1 1 1
# Note: we cannot obtain the classes information for all columns.
print(le.classes_)
# ['No' 'Yes']
答案 1 :(得分:3)
首先,找出类型为object的所有功能:
objList = all_data.select_dtypes(include = "object").columns
print (objList)
现在,要将上述objList功能转换为数字类型,可以使用如下所示的forloop:
#Label Encoding for object to numeric conversion
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for feat in objList:
df[feat] = le.fit_transform(df[feat].astype(str))
print (df.info())
请注意,我们在forloop中明确提到类型字符串,因为如果删除它会引发错误。
答案 2 :(得分:1)
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelBinarizer
# df is the pandas dataframe
class preprocessing (BaseEstimator, TransformerMixin):
def __init__ (self, df):
self.datatypes = df.dtypes.astype(str)
self.catcolumns = []
self.cat_encoders = []
self.encoded_df = []
def fit (self, df, y = None):
for ix, val in zip(self.datatypes.index.values,
self.datatypes.values):
if val =='object':
self.catcolumns.append(ix)
fit_objs = [str(i) for i in range(len(self.catcolumns))]
for encs, name in zip(fit_objs,self.catcolumns):
encs = LabelBinarizer()
encs.fit(df[name])
self.cat_encoders.append((name, encs))
return self
def transform (self, df , y = None):
for name, encs in self.cat_encoders:
df_c = encs.transform(df[name])
self.encoded_df.append(pd.DataFrame(df_c))
self.encoded_df = pd.concat(self.encoded_df, axis = 1,
ignore_index
= True)
self.df_num = df.drop(self.catcolumns, axis = 1)
y = pd.concat([self.df_num, self.encoded_df], axis = 1,
ignore_index = True)
return y
# use return y.values to use in sci-kit learn pipeline
""" Finds categorical columns in a dataframe and one hot encodes the
columns. you can replace labelbinarizer with labelencoder if you
require only label encoding. Function returns encoded categorcial data
and numerical data as a dataframe """
答案 3 :(得分:1)
Scikit-learn现在对此有帮助:OrdinalEncoder
from sklearn.preprocessing import OrdinalEncoder
data = pd.DataFrame({'A': [1, 2, 3, 4],
'B': ["Yes", "No", "Yes", "Yes"],
'C': ["Yes", "No", "No", "Yes"],
'D': ["No", "Yes", "No", "Yes"]})
oe = OrdinalEncoder()
t_data = oe.fit_transform(data)
print(t_data)
# [[0. 1. 1. 0.]
# [1. 0. 0. 1.]
# [2. 1. 0. 0.]
# [3. 1. 1. 1.]]
直接使用即可。
答案 4 :(得分:1)
如果您知道列的名称并且不想使用所有列,您可以执行以下操作(您也正在摆脱 for 循环):
categ = ['Pclass','Cabin_Group','Ticket','Embarked']
# Encode Categorical Columns
le = LabelEncoder()
df[categ] = df[categ].apply(le.fit_transform)
答案 5 :(得分:0)
您还可以遍历要对其应用编码的不同列。这种方法可能不是最有效,但效果很好。
from sklearn import preprocessing
LE = preprocessing.LabelEncoder()
for col in df.columns:
df[col] = LE.fit(df[col])
df[col] = LE.transform(df[col])
test_data[col] = LE.transform(test_data[col])
答案 6 :(得分:0)
这是我能写的最简单的:
第 1 步:获取所有分类列:
categorical_columns = train.select_dtypes(['object']).columns
这将存储所有分类列。
Step2:写一个 for 循环来转换,因为 fit_transform 一次只需要 1 个索引。但这是裂缝。
from sklearn.preprocessing import LabelEncoder
label_encoder = preprocessing.LabelEncoder()
for col in train[categorical_columns]:
train[col]= label_encoder.fit_transform(train[col])
第3步:投票哈哈:)
希望你觉得这有用。