我有以下数据框,其中有关于不同主题的功能的记录:
ID Feature
-------------------------
1 A
1 B
2 A
1 A
3 B
3 B
1 C
2 C
3 D
我想获得另一个(汇总的?)数据框,其中每一行代表一个特定主题,并且列出了所有单热编码功能的详尽列表:
ID FEATURE_A FEATURE_B FEATURE_C FEATURE D
--------------------------------------------
1 1 1 1 0
2 1 0 1 0
3 0 1 0 0
如何在Python(Pandas)中实现?
奖励:如何实现功能列包含出现次数的版本,而不仅仅是二进制标志?
答案 0 :(得分:3)
将join
与get_dummies
一起使用,然后groupby
使用并汇总max
:
df =df[['ID']].join(pd.get_dummies(df['Feature']).add_prefix('FEATURE_')).groupby('ID').max()
print (df)
FEATURE_A FEATURE_B FEATURE_C FEATURE_D
ID
1 1 1 1 0
2 1 0 1 0
3 0 1 0 1
详情:
print (pd.get_dummies(df['Feature']))
A B C D
0 1 0 0 0
1 0 1 0 0
2 1 0 0 0
3 1 0 0 0
4 0 1 0 0
5 0 1 0 0
6 0 0 1 0
7 0 0 1 0
8 0 0 0 1
使用MultiLabelBinarizer和DataFrame
构造函数的另一种解决方案:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df['Feature']),
columns=['FEATURE_' + x for x in mlb.classes_],
index=df.ID).max(level=0)
print (df1)
FEATURE_A FEATURE_B FEATURE_C FEATURE_D
ID
1 1 1 1 0
2 1 0 1 0
3 0 1 0 1
<强>计时强>:
np.random.seed(123)
N = 100000
L = list('abcdefghijklmno'.upper())
df = pd.DataFrame({'Feature': np.random.choice(L, N),
'ID':np.random.randint(10000,size=N)})
def jez(df):
mlb = MultiLabelBinarizer()
return pd.DataFrame(mlb.fit_transform(df['Feature']),
columns=['FEATURE_' + x for x in mlb.classes_],
index=df.ID).max(level=0)
#jez1
In [464]: %timeit (df[['ID']].join(pd.get_dummies(df['Feature']).add_prefix('FEATURE_')).groupby('ID').max())
10 loops, best of 3: 39.3 ms per loop
In [465]: %timeit (jez(df))
10 loops, best of 3: 138 ms per loop
#Scott Boston1
In [466]: %timeit (df.set_index('ID')['Feature'].str.get_dummies().add_prefix('FEATURE_').max(level=0))
1 loop, best of 3: 1.03 s per loop
#wen1
In [467]: %timeit (pd.crosstab(df.ID,df.Feature).gt(0).astype(int).add_prefix('FEATURE '))
1 loop, best of 3: 383 ms per loop
#wen2
In [468]: %timeit (pd.get_dummies(df.drop_duplicates().set_index('ID')).sum(level=0))
10 loops, best of 3: 47 ms per loop
Feature
和ID
的比例,结果无法解决性能问题,这会对某些解决方案的时间产生很大影响。< p>
答案 1 :(得分:2)
另一个类似的选项是使用set_index
,.str
(字符串访问者),get_dummies
和max
和level=0
参数,然后使用{{1} }更改列名:
add_prefix
输出:
df.set_index('ID')['Feature'].str.get_dummies().add_prefix('FEATURE_').max(level=0)
答案 2 :(得分:1)
使用pd.crosstab
pd.crosstab(df.ID,df.Feature).gt(0).astype(int).add_prefix('FEATURE ')
Out[805]:
Feature FEATURE A FEATURE B FEATURE C FEATURE D
ID
1 1 1 1 0
2 1 0 1 0
3 0 1 0 1
或使用drop_duplicates
然后使用get_dummies
pd.get_dummies(df.drop_duplicates().set_index('ID')).sum(level=0)
Out[808]:
Feature_A Feature_B Feature_C Feature_D
ID
1 1 1 1 0
2 1 0 1 0
3 0 1 0 1
附加答案: 如何实现功能列包含出现次数的版本,而不仅仅是二进制标记?
选项1
pd.crosstab(df.ID,df.Feature)
Out[809]:
Feature A B C D
ID
1 2 1 1 0
2 1 0 1 0
3 0 2 0 1
或
选项2
pd.get_dummies(df.set_index('ID')).sum(level=0)