Question

所以我试图做一个项目，要求为某个部分做一个热门编码。但我不知道如何使用它。香港专业教育学院一直在使用谷歌尝试和理解，但我只是无法理解。我的问题如下。

现在，我们也希望使用分类功能！因此，我们必须履行 OneHotEncoding用于分类功能。为此，每个分类功能都应该在要素表中用虚拟列替换（每个可能值一列）（分类特征），然后以二进制方式对其进行编码，使得最多只对其进行编码其中一个虚拟列可以一次取“1”（其余为0）。例如， “性别”可以取两个值“m”和“f”。因此，我们需要更换此功能（在特征表）由两列标题为“m”和“f”。无论在哪里，我们都有男性主题，我们可以在“m”和“f”列中加上“1”和“0”。无论在哪里，我们都有一个女性主题，我们可以在“m”和“f”列中加上“0”和“1”。（提示：您需要4列进行编码 “ChestPain”和3列编码“Thal”）。

到目前为止，我的代码就是这个，

# a- Read the dataset from the following URL:
# and assign it to a Pandas DataFrame 

heart_d = pd.read_csv("C:/Users/Michael/Desktop/HW2/Heart_s.csv")


feature_cols = ['Age','RestBP','Chol','RestECG','MaxHR','Oldpeak']
X = heart_d[feature_cols]

y = heart_d['AHD']

# Randomly splitting the original dataset into training set and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=3)

到目前为止这个工作，但现在我必须使用那个热门编码的catagorical东西，但我完全失去了它的工作原理。数据集中的3个分类特征是（Gender，ChestPain，塔尔）。我试过这个

df_cp = pd.get_dummies(heart_d['ChestPain'])
df_g = pd.get_dummies(heart_d['Gender'])
df_t = pd.get_dummies(heart_d['Thal'])

df_new = pd.concat([df, df_cp,df_g,df_t ], axis=1)

但我不确定那是否有效，当我运行我的分类时，我会得到相同的答案

Answer 1

我猜你可以使用scikit-learn作为数据训练，这里是it中的单热编码器示例：

from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])  
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
   handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9])
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])

====更新====

我写了一个关于如何使用单热编码器进行字符串属性的详细示例，DictVectorizer

import pandas as pd
from sklearn.feature_extraction import DictVectorizer as DV        

d = [
    {'country':'A', 'Gender':'M'},
    {'country':'B', 'Gender':'F'},                                 
    {'country':'C', 'Gender':'F'}
]               
df = pd.DataFrame(d)                                               
print df        
test_d = [
    {'country':'A', 'Gender':'F'},                                 
    {'country':'B', 'Gender':'F'}

]                                                                  
test_df = pd.DataFrame(test_d)
print test_df                                                      

train_x = df.T.to_dict().values()                                  
vx = DV(sparse=False)

transform_x = vx.fit_transform(train_x)
print 'transform_train_df'
print transform_x

test_x = test_df.T.to_dict().values()
transform_test_x = vx.transform(test_x)
print 'transform_test_df'
print transform_test_x

输出：

  Gender country
0      M       A
1      F       B
2      F       C
  Gender country
0      F       A
1      F       B
transform_train_df
[[ 0.  1.  1.  0.  0.]
 [ 1.  0.  0.  1.  0.]
 [ 1.  0.  0.  0.  1.]]
transform_test_df
[[ 1.  0.  1.  0.  0.]
 [ 1.  0.  0.  1.  0.]]

如何使用onehotcoding

1 个答案: