假设我有以下数据集(2行2列,标题为Char0和Char1):
dataset = [['A', 'B'], ['B', 'C']]
columns = ['Char0', 'Char1']
df = pd.DataFrame(dataset, columns=columns)
我想对Char0和Char1列进行一次热编码,所以:
df = pd.concat([df, pd.get_dummies(df["Char0"], prefix='Char0')], axis=1)
df = pd.concat([df, pd.get_dummies(df["Char1"], prefix='Char1')], axis=1)
df.drop(['Char0', "Char1"], axis=1, inplace=True)
这将导致数据列的标题为Char0_A,Char0_B,Char1_B,Char1_C。
现在,我想为每列分别指示A,B,C和D(即使当前数据集中没有“ D”)。在这种情况下,这意味着8列:Char0_A,Char0_B,Char0_C,Char0_D,Char1_A,Char1_B,Char1_C,Char1_D。
有人可以帮我吗?
答案 0 :(得分:2)
对所有列使用get_dummies
,然后将DataFrame.reindex
与itertools.product
创建的列的所有可能组合相加:
dataset = [['A', 'B'], ['B', 'C']]
columns = ['Char0', 'Char1']
df = pd.DataFrame(dataset, columns=columns)
vals = ['A','B','C','D']
from itertools import product
cols = ['_'.join(x) for x in product(df.columns, vals)]
print (cols)
['Char0_A', 'Char0_B', 'Char0_C', 'Char0_D', 'Char1_A', 'Char1_B', 'Char1_C', 'Char1_D']
df1 = pd.get_dummies(df).reindex(cols, axis=1, fill_value=0)
print (df1)
Char0_A Char0_B Char0_C Char0_D Char1_A Char1_B Char1_C Char1_D
0 1 0 0 0 0 1 0 0
1 0 1 0 0 0 0 1 0