单热编码多级列数据

时间:2017-10-17 13:40:48

标签: python pandas encoding

我有以下数据框,其中有关于不同主题的功能的记录:

ID   Feature
-------------------------
1    A
1    B
2    A
1    A
3    B
3    B
1    C
2    C
3    D

我想获得另一个(汇总的?)数据框,其中每一行代表一个特定主题,并且列出了所有单热编码功能的详尽列表:

ID   FEATURE_A FEATURE_B FEATURE_C FEATURE D
--------------------------------------------
1    1         1         1         0
2    1         0         1         0
3    0         1         0         0

如何在Python(Pandas)中实现?

奖励:如何实现功能列包含出现次数的版本,而不仅仅是二进制标志?

3 个答案:

答案 0 :(得分:3)

joinget_dummies一起使用,然后groupby使用并汇总max

df =df[['ID']].join(pd.get_dummies(df['Feature']).add_prefix('FEATURE_')).groupby('ID').max()
print (df)
    FEATURE_A  FEATURE_B  FEATURE_C  FEATURE_D
ID                                            
1           1          1          1          0
2           1          0          1          0
3           0          1          0          1

详情:

print (pd.get_dummies(df['Feature']))
   A  B  C  D
0  1  0  0  0
1  0  1  0  0
2  1  0  0  0
3  1  0  0  0
4  0  1  0  0
5  0  1  0  0
6  0  0  1  0
7  0  0  1  0
8  0  0  0  1

使用MultiLabelBinarizerDataFrame构造函数的另一种解决方案:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df['Feature']),
                   columns=['FEATURE_' + x for x in mlb.classes_], 
                   index=df.ID).max(level=0)
print (df1)
    FEATURE_A  FEATURE_B  FEATURE_C  FEATURE_D
ID                                            
1           1          1          1          0
2           1          0          1          0
3           0          1          0          1

<强>计时

np.random.seed(123)
N = 100000
L = list('abcdefghijklmno'.upper()) 
df = pd.DataFrame({'Feature': np.random.choice(L, N),
                   'ID':np.random.randint(10000,size=N)})

def jez(df):
    mlb = MultiLabelBinarizer()
    return pd.DataFrame(mlb.fit_transform(df['Feature']),
                   columns=['FEATURE_' + x for x in mlb.classes_], 
                   index=df.ID).max(level=0)


#jez1
In [464]: %timeit (df[['ID']].join(pd.get_dummies(df['Feature']).add_prefix('FEATURE_')).groupby('ID').max())
10 loops, best of 3: 39.3 ms per loop

In [465]: %timeit (jez(df))
10 loops, best of 3: 138 ms per loop

#Scott Boston1
In [466]: %timeit (df.set_index('ID')['Feature'].str.get_dummies().add_prefix('FEATURE_').max(level=0))
1 loop, best of 3: 1.03 s per loop

#wen1
In [467]: %timeit (pd.crosstab(df.ID,df.Feature).gt(0).astype(int).add_prefix('FEATURE '))
1 loop, best of 3: 383 ms per loop

#wen2
In [468]: %timeit (pd.get_dummies(df.drop_duplicates().set_index('ID')).sum(level=0))
10 loops, best of 3: 47 ms per loop

警告

考虑到FeatureID的比例,结果无法解决性能问题,这会对某些解决方案的时间产生很大影响。< p>

答案 1 :(得分:2)

另一个类似的选项是使用set_index.str(字符串访问者),get_dummiesmaxlevel=0参数,然后使用{{1} }更改列名:

add_prefix

输出:

df.set_index('ID')['Feature'].str.get_dummies().add_prefix('FEATURE_').max(level=0)

答案 2 :(得分:1)

使用pd.crosstab

pd.crosstab(df.ID,df.Feature).gt(0).astype(int).add_prefix('FEATURE ')
Out[805]: 
Feature  FEATURE A  FEATURE B  FEATURE C  FEATURE D
ID                                                 
1                1          1          1          0
2                1          0          1          0
3                0          1          0          1

或使用drop_duplicates然后使用get_dummies

pd.get_dummies(df.drop_duplicates().set_index('ID')).sum(level=0)
Out[808]: 
    Feature_A  Feature_B  Feature_C  Feature_D
ID                                            
1           1          1          1          0
2           1          0          1          0
3           0          1          0          1

附加答案: 如何实现功能列包含出现次数的版本,而不仅仅是二进制标记?

选项1

pd.crosstab(df.ID,df.Feature)
Out[809]: 
Feature  A  B  C  D
ID                 
1        2  1  1  0
2        1  0  1  0
3        0  2  0  1

选项2

pd.get_dummies(df.set_index('ID')).sum(level=0)