我想对具有相同"类型"的多列的数据帧进行编码,例如:
import pandas as pd
df = pd.DataFrame(data=[["France", "Bupapest", "Sweden", "Paris"], ["Italy", "Frankfurt", "France", "Naples"]], columns=["Countries 1", "Cities 1", "Countries 2", "Cities 2"])
print(df)
输出:
Countries 1 Cities 1 Countries 2 Cities 2
0 France Bupapest Sweden Paris
1 Italy Frankfurt France Naples
如何通过传入应被视为一个的列索引来使用一个热编码对此数据帧进行编码?在这个例子中,我将传递[0,2]和[1,3],因为国家1和国家2列有3个不同的国家组合,因此应该有3个类别,而不是每个2个,两个国家的原则相同列。
答案 0 :(得分:2)
我正在使用wide_to_long
展平df,然后使用factorize
+ unstack
s=pd.wide_to_long(df.reset_index(),stubnames=['Countries','Cities'],i='index',j='unstack',sep=' ').apply(lambda x : pd.factorize(x)[0]+1).unstack()
s.columns=s.columns.map('{0[0]} {0[1]}'.format)
s=s.reindex(columns=df.columns)
s
Out[1377]:
Countries 1 Cities 1 Countries 2 Cities 2
index
0 1 1 3 3
1 2 2 1 4
或get_dummies
s=pd.get_dummies(pd.wide_to_long(df.reset_index(),stubnames=['Countries','Cities'],i='index',j='unstack',sep=' '))
s
Out[1392]:
Countries_France Countries_Italy Countries_Sweden \
index unstack
0 1 1 0 0
1 1 0 1 0
0 2 0 0 1
1 2 1 0 0
Cities_Bupapest Cities_Frankfurt Cities_Naples Cities_Paris
index unstack
0 1 1 0 0 0
1 1 0 1 0 0
0 2 0 0 0 1
1 2 0 0 1 0