我有一个需要标签编码的数据集。我正在使用sklearn的标签编码器。
以下是该问题的可复制代码:
dta <- data.frame(matrix(1:60, ncol = 6))
names(dta) <- c('X1_dim1', 'X1_dim2', 'X2_dim1', 'X2_dim2', 'X3_dim1', 'X3_dim2')
所需的输出应类似于:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
data11 = pd.DataFrame({'Transaction_Type': ['Mortgage', 'Credit reporting', 'Consumer Loan', 'Mortgage'],
'Complaint_reason': ['Incorrect Info', 'False Statement', 'Using a Debit Card', 'Payoff process'],
'Company_response': ['Response1', 'Response2', 'Response3', 'Response1'],
'Consumer_disputes': ['Yes', 'No', 'No', 'Yes'],
'Complaint_Status': ['Processing','Closed', 'Awaiting Response', 'Closed']
})
le = LabelEncoder()
data11['Transaction_Type'] = le.fit_transform(data11['Transaction_Type'])
data11['Complaint_reason'] = le.transform(data11['Complaint_reason'])
data11['Company_response'] = le.fit_transform(data11['Company_response'])
data11['Consumer_disputes'] = le.transform(data11['Consumer_disputes'])
data11['Complaint_Status'] = le.transform(data11['Complaint_Status'])
问题是当我尝试对列进行编码时: “ Transaction_Type”和“ Company_response”已成功编码,但是“ Complaint_reason”,“ Consumer_disputes”和“ Complaint_Status”列引发错误。
对于“投诉原因”:
({'Transaction_Type': ['1', '2', '3', '1'],
'Complaint_reason': ['1', '2', '3', '4'],
'Company_response': ['1', '2', '3', '1'],
'Consumer_disputes': ['1', '2', '2', '1'],
'Complaint_Status': ['1','2', '3', '2']
})
以及类似的“ Consumer_disputes”:
File "C:/Users/Ashu/untitled0.py", line 26, in <module>
data11['Complaint_reason'] = le.transform(data11['Complaint_reason'])
ValueError: y contains new labels: ['APR or interest rate' 'Account opening, closing, or management'
'Account terms and changes' ...
"Was approved for a loan, but didn't receive the money"
'Written notification about debt' 'Wrong amount charged or received']
以及类似的“投诉状态”:
File "<ipython-input-117-9625bd78b740>", line 1, in <module>
data11['Consumer_disputes'] = le.transform(data11['Consumer_disputes'].astype(str))
ValueError: y contains new labels: ['No' 'Yes']
这些都是类别变量,具有固定输入形式的句子形式。以下是数据切片图像:
Categorical Data Label Encoding
关于SO的问题有两个,但都没有成功回答。
答案 0 :(得分:0)
由于所有列都不相同,我认为您需要为每个列初始化le
:
for col in data11.columns:
le = LabelEncoder()
data11[col] = le.fit_transform(data11[col])
Transaction_Type Complaint_reason Company_response Consumer_disputes \
0 2 1 0 1
1 1 0 1 0
2 0 3 2 0
3 2 2 0 1
Complaint_Status
0 2
1 1
2 0
3 1
答案 1 :(得分:0)
您缺少 fit_transform(),这就是为什么您出错了。
sklearn.preprocessing.LabelEncoder ->使用介于0和n_classes-1之间的值编码标签(来自官方文档)
仍然,如果要在1到n_class之间编码类,则只需添加1。
data11['Transaction_Type'] = le.fit_transform(data11['Transaction_Type'])
data11['Transaction_Type']
输出:
0 2
1 1
2 0
3 2
Name: Transaction_Type, dtype: int64
通知,LabelEncoder()会按字母顺序进行编码,它将为 Consumer Loan (消费贷款)赋予标签0,该标签按字母顺序排在第一位。同样,该标签为 Mortage 标记为2,该标签排在最后。
现在,您有两种方式对其进行编码,要么接受像这样的LabelEncoder的默认输出,
data11['Transaction_Type'] = le.fit_transform(data11['Transaction_Type'])
data11['Complaint_reason'] = le.fit_transform(data11['Complaint_reason'])
data11['Company_response'] = le.fit_transform(data11['Company_response'])
data11['Consumer_disputes'] = le.fit_transform(data11['Consumer_disputes'])
data11['Complaint_Status'] = le.fit_transform(data11['Complaint_Status'])
输出:
Transaction_Type Complaint_reason Company_response Consumer_disputes Complaint_Status
0 2 1 0 1 2
1 1 0 1 0 1
2 0 3 2 0 0
3 2 2 0 1 1
OR
data11['Transaction_Type'] = le.fit_transform(data11['Transaction_Type'].sort_values()) + 1
data11['Complaint_reason'] = le.fit_transform(data11['Complaint_reason'].sort_values()) + 1
data11['Company_response'] = le.fit_transform(data11['Company_response']) + 1
data11['Consumer_disputes'] = le.fit_transform(data11['Consumer_disputes'].sort_values()) + 1
data11['Complaint_Status'] = le.fit_transform(data11['Complaint_Status'].sort_values()) + 1
输出:
Transaction_Type Complaint_reason Company_response Consumer_disputes Complaint_Status
0 1 1 1 1 1
1 2 2 2 1 2
2 3 3 3 2 2
3 3 4 1 2 3