我有2个列表features
和labels
。
features
包含疾病,年龄,性别, PIN 。
labels
包含健康计划。
用户通过user_input
格式的features
。因此,代码应使用DecisionTree
API中的sklearn
为用户预测健康计划。
features
中很少有参数是Strings
。例如,疾病和性别。我正在使用LabelEncoder
对其进行编码,以避免出现错误“ ValueError: could not convert string to float
”。
现在,在使用Label Encoder
之后,我得到了以下异常'ValueError: bad input shape
'
如何解决此问题,然后再次反转完成的编码以避免String to Float
错误。请帮忙。
from sklearn import tree
from sklearn.preprocessing import LabelEncoder
features = [['TB' , 28, 'MALE', 121001], ['TB' , 28, 'FEMALE', 121002], ['CANCER' , 28, 'MALE', 121001], ['CANCER' , 28, 'FEMALE', 121001]]
labels = ['X125434', 'X125436','X125437' , 'X125437']
user_input = ['TB' , 28, 'MALE', 121001]
le = LabelEncoder()
Y = le.fit_transform(features)
X = le.fit_transform(labels)
new_user_input = le.fit_transform(user_input)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(new_features, new_labels)
print(clf.predict([new_ui]))
答案 0 :(得分:3)
不建议对数据集中的所有功能使用相同的标签编码器。为每个列创建标签编码器是安全的,因为每个功能的值都不同。
from sklearn import tree
from sklearn.preprocessing import LabelEncoder
import pandas as pd
features = [['TB' , 28, 'MALE', 121001], ['TB' , 28, 'FEMALE', 121002], ['CANCER' , 28, 'MALE', 121001], ['CANCER' , 28, 'FEMALE', 121001]]
labels = ['X125434', 'X125436','X125437' , 'X125437']
feature_names=['Disease','Age','Gender','PIN']
user_input = ['TB' , 28, 'MALE', 121001]
train=pd.DataFrame(data=features,columns=['Disease','Age','Gender','PIN'])
train['Labels']=labels
test=pd.DataFrame(columns=['Disease','Age','Gender','PIN'])
test.loc[len(test)]=user_input
le_disease = LabelEncoder()
le_gender = LabelEncoder()
le_labels = LabelEncoder()
train['Disease'] = le_disease.fit_transform(train['Disease'])
train['Gender'] = le_gender.fit_transform(train['Gender'])
train['Labels'] = le_labels.fit_transform(train['Labels'])
test['Disease'] = le_disease.transform(test['Disease'])
test['Gender'] = le_gender.transform(test['Gender'])
clf = tree.DecisionTreeClassifier()
clf = clf.fit(train[feature_names], train['Labels'])
print(le_labels.inverse_transform(clf.predict(test[feature_names])))
LabelEncoder.inverse_transform()
可用于取回原始数据。
答案 1 :(得分:2)
根据LabelEncoder documentation,看来您使用的方式有误,因此您得到的例外是说的完全正确。
在您的情况下,我认为您想将Diseases
,Gender
和Health-Plan
编码为整数:例如,
TB
和CANCER
将变成0
和1
,MALE
和FEMALE
将变成0
和1
好; X125434
,X125436
,X125437
将被编码为0
,1
,2
。
示例:
from sklearn import tree
from sklearn.preprocessing import LabelEncoder
features = [
['TB' , 28, 'MALE', 121001],
['TB' , 28, 'FEMALE', 121002],
['CANCER' , 28, 'MALE', 121001],
['CANCER' , 28, 'FEMALE', 121001]]
labels = ['X125434', 'X125436','X125437' , 'X125437']
user_input = ['TB' , 28, 'MALE', 121001]
# use different encoders for different data
le = LabelEncoder()
le_diseases = LabelEncoder()
le_gender = LabelEncoder()
diseases = [features_list[0] for features_list in features]
gender = [features_list[2] for features_list in features]
features_preprocessed = []
diseases_labels = le_diseases.fit_transform(diseases)
gender_labels = le_gender.fit_transform(gender)
for i, features_list in enumerate(features):
features_preprocessed.append([
diseases_labels[i],
features[i][1],
gender_labels[i],
features[i][3]])
labels_preprocessed = le.fit_transform(labels)
# ... then use features_preprocessed, labels_preprocessed and the label encoders above
P.S。我建议您使用pandas数据框而不是列表:从上面的示例中可以看出,在这种情况下使用列表看起来并不是很干净。您的功能将如下所示:
import pandas as pd
features_df = pd.DataFrame({
'Diseases': ['TB' , 'TB', 'CANCER', 'CANCER'],
'Age': [28, 28, 28, 28],
'Gender': ['MALE', 'FEMALE', 'MALE', 'FEMALE'],
'PIN': [121001, 121002, 121001, 121001]
})