Question

我希望利用pandas get_dummy（）功能来编码（相当广泛的）一组分类变量。但是，数据当前采用嵌套表格格式。这意味着每行代表另一个变量实例，例如

Instance, Cat_Col
1, John
1, Smith
2, Jane
3, Joe

现在我可以生成唯一变量的完整列表，我可以使用它来表示所有可能值的get_dummies。但是，以这种新格式将嵌套表转换为单个实例行会给我带来一些麻烦。

非常感谢任何帮助感谢

编辑：每个实例都应该具有Cat_col

的所有值的虚拟编码结果

这个想法将是一个像这样的单一特征向量的结果

Instance,Col_John,Col_Smith,Col_Jane,Col_Joe
1,1,1,0,0
2,0,0,1,0
3,0,0,0,1

我认为这是正确的编码，假设我们正在进行1-hot编码

Answer 1

您可能需要考虑使用pivot_table来实现目标。

import pandas as pd

df

Out[10]: 
   Instance Cat_Col
0         1    John
1         1   Smith
2         2    Jane
3         3     Joe

df['count'] = 1
df.pivot('Instance', 'Cat_Col', 'count').fillna(0)

Out[11]: 
Cat_Col    Jane   Joe   John   Smith
Instance                            
1             0     0      1       1
2             1     0      0       0
3             0     1      0       0

如果您更喜欢使用get_dummies，

result = pd.get_dummies(df.Cat_Col)
result['Instance'] = df.Instance
result = result.set_index('Instance')
result.groupby(level=0).apply(max)

Out[26]: 
           Jane   Joe   John   Smith
Instance                            
1             0     0      1       1
2             1     0      0       0
3             0     1      0       0

Pandas Get_dummies用于嵌套表

1 个答案: