一种热编码 - 将多列编码为一个

时间:2018-02-07 16:52:14

标签: python pandas

我想对具有相同"类型"的多列的数据帧进行编码,例如:

import pandas as pd

df = pd.DataFrame(data=[["France", "Bupapest", "Sweden", "Paris"], ["Italy", "Frankfurt", "France", "Naples"]], columns=["Countries 1", "Cities 1", "Countries 2", "Cities 2"])
print(df)

输出:

  Countries 1   Cities 1 Countries 2 Cities 2
0      France   Bupapest      Sweden    Paris
1       Italy  Frankfurt      France   Naples

如何通过传入应被视为一个的列索引来使用一个热编码对此数据帧进行编码?在这个例子中,我将传递[0,2]和[1,3],因为国家1和国家2列有3个不同的国家组合,因此应该有3个类别,而不是每个2个,两个国家的原则相同列。

1 个答案:

答案 0 :(得分:2)

我正在使用wide_to_long展平df,然后使用factorize + unstack

s=pd.wide_to_long(df.reset_index(),stubnames=['Countries','Cities'],i='index',j='unstack',sep=' ').apply(lambda x : pd.factorize(x)[0]+1).unstack()

s.columns=s.columns.map('{0[0]} {0[1]}'.format)

s=s.reindex(columns=df.columns)
s
Out[1377]: 
       Countries 1  Cities 1  Countries 2  Cities 2
index                                              
0                1         1            3         3
1                2         2            1         4

或get_dummies

s=pd.get_dummies(pd.wide_to_long(df.reset_index(),stubnames=['Countries','Cities'],i='index',j='unstack',sep=' '))

s
Out[1392]: 
               Countries_France  Countries_Italy  Countries_Sweden  \
index unstack                                                        
0     1                       1                0                 0   
1     1                       0                1                 0   
0     2                       0                0                 1   
1     2                       1                0                 0   
               Cities_Bupapest  Cities_Frankfurt  Cities_Naples  Cities_Paris  
index unstack                                                                  
0     1                      1                 0              0             0  
1     1                      0                 1              0             0  
0     2                      0                 0              0             1  
1     2                      0                 0              1             0