one-hot编码每个特征分类数据中的1个以上的值

时间:2018-02-28 17:14:09

标签: python machine-learning scikit-learn data-science categorical-data

我对scikitlearn很新,现在我正在努力处理预处理阶段。

我有以下分类功能(我解析了一个JSON文件并将其放在字典中)所以:

dct['alcohol'] = ["Binge drinking",
  "Heavy drinking",
  "Moderate consumption",
  "Low consumption",
  "No consumption"]


dct['tobacco']= ["Current daily smoker - heavy",
  "Current daily smoker",
  "Current on-and-off smoker",
  "Former Smoker",
  "Never Smoked",
  "Snuff User"]

dct['onset'] = "Gradual",
  "Sudden"]

我的第一种方法是首先将它转换为带有标签enconder的整数,然后转换为单热编码方法:

OH_enc = sklearn.preprocessing.OneHotEncoder(n_values=[len(dct['alcohol']),len(dct['tobacco']),len(dct['onset'])])
le_alc = sklearn.preprocessing.LabelEncoder()
le_tobacco = sklearn.preprocessing.LabelEncoder()
le_onset = sklearn.preprocessing.LabelEncoder()

le_alc.fit(dct['alcohol'])
le_tobacco.fit(dct['tobacco'])
le_onset.fit(dct['onset'])


list_patient = []
list_patient.append(list(le_alc.transform(['Low consumption'])))
list_patient.append(list(le_tobacco.transform(['Former Smoker'])))
list_patient.append(list(le_onset.transform(['Sudden'])))

list1 = []
list1.append(np.array(list_patient).T[0][:])
list1.append([1,2,0])

OH_enc.fit(list1)
print(OH_enc.transform([[4,2,0]]).toarray())

所以最终如果你OHE(4,2,0)得到:

[[0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0.]]

这就是我想要的,因为前5列是指“酒精”特征,后6列是指烟草,后2列是指起始特征。

但是,我们假设一个示例在一个功能中可能有多个值。假设一个例子从酒精特征中获得“狂饮”和“大量饮酒”。然后,如果你OHE([0,1],2,0),你会得到:

[[1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0.]]

最后一步我不知道如何使用sklearn.preprocessing.OneHotEncoder对其进行编码。我的意思是,我如何在每个示例的一个功能中编码2个值?

我知道可能有更好的方法来编码“酒精”,“烟草”和“开始”,因为它们是序数值(然后每个特征中的每个值都与同一特征中的其他值相关。我可以只标记它们然后将其标准化。但是我们假设那些是具有独立关系的分类变量。

1 个答案:

答案 0 :(得分:2)

我终于使用MultilabelBinarizer解决了它,正如@VivekKumar建议的那样:

headings = dct['alcohol'] + dct['tobacco'] + dct['onset']

#print('my headings:'+ str(headings))

l1 = ['Heavy drinking, Low consumption, Former Smoker, Gradual', 'Low consumption, No consumption, Current on-and-off smoker, Sudden', 'Heavy drinking, Current on-and-off smoker']


mlb = MultiLabelBinarizer()  # pass sparse_output=True if you'd like
dataMatrix = mlb.fit_transform(headings.split(', ') for headings in l1)

print("My Classes: ")
print(mlb.classes_)
print(dataMatrix)