Question

目前我有下一个数据帧：

import pandas as pd
df= pd.DataFrame({"ID" : ['1','2','3','4','5'], 
                     "col2" : [['a', 'b', 'c'], 
                               ['c', 'd', 'e', 'f'], 
                               ['f', 'b', 'f'], 
                               ['a', 'c', 'b'], 
                               ['b', 'a', 'b']]})

print(df)
  ID          col2
0  1     [a, b, c]
1  2  [c, d, e, f]
2  3     [f, b, f]
3  4     [a, c, b]
4  5     [b, a, d]

我想为col2创建一个新的数据框，如下所示：

    ID   a   b   c   d   e   f
0   1    1   1   1   0   0   0
1   2    0   0   1   1   1   1
2   3    0   1   0   0   0   1
3   4    1   1   1   0   0   0
4   5    1   1   0   1   0   0

使用以下代码为列列表中的每个字母生成不同的列：

df2= df.col2.str.get_dummies(sep = ",")
pd.concat([data['col1'], df], axis=1)

ID  a   b   b]  c   c]  d   d]  e   f]  [a [b  [c  [f
1   0   1   0   0   1   0   0   0   0   1   0   0   0
2   0   0   0   0   0   1   0   1   1   0   0   1   0
3   0   1   0   0   0   0   0   0   1   0   0   0   1
4   0   0   1   1   0   0   0   0   0   1   0   0   0
5   1   0   0   0   0   0   1   0   0   0   1   0   0

使用以下代码根据列的位置为列表中的每个字母生成不同的列。你们有谁知道为什么要这样做？ pd.get_dummies选项也不起作用。

Answer 1

str.get_dummies适用于字符串，因此您可以将列表转换为以字母分隔的字符串，并在该字符串上使用str_get_dummies。例如，

df['col2'].str.join('@').str.get_dummies('@')
Out: 
   a  b  c  d  e  f
0  1  1  1  0  0  0
1  0  0  1  1  1  1
2  0  1  0  0  0  1
3  1  1  1  0  0  0
4  1  1  0  0  0  0

此处，@是一个未出现在列表中的任意字符。

然后，你可以像往常一样结束：

pd.concat([df['ID'], df['col2'].str.join('@').str.get_dummies('@')], axis=1)
Out: 
  ID  a  b  c  d  e  f
0  1  1  1  1  0  0  0
1  2  0  0  1  1  1  1
2  3  0  1  0  0  0  1
3  4  1  1  1  0  0  0
4  5  1  1  0  0  0  0

Answer 2

使用理解dicts可能会更快

In [40]: pd.DataFrame({k: 1 for k in x} for x in df.col2.values).fillna(0).astype(int)
Out[40]:
   a  b  c  d  e  f
0  1  1  1  0  0  0
1  0  0  1  1  1  1
2  0  1  0  0  0  1
3  1  1  1  0  0  0
4  1  1  0  0  0  0    

In [48]: pd.concat([
                df['ID'], 
                pd.DataFrame({k: 1 for k in x} for x in df.col2).fillna(0).astype(int)],
            axis=1)
Out[48]:
  ID  a  b  c  d  e  f
0  1  1  1  1  0  0  0
1  2  0  0  1  1  1  1
2  3  0  1  0  0  0  1
3  4  1  1  1  0  0  0
4  5  1  1  0  0  0  0

<强>计时

In [2942]: df.shape
Out[2942]: (50000, 2)

In [2945]: %timeit pd.DataFrame({k: 1 for k in x} for x in df.col2).fillna(0).astype(int)
10 loops, best of 3: 137 ms per loop

In [2946]: %timeit df['col2'].str.join('@').str.get_dummies('@')
1 loop, best of 3: 395 ms per loop

Answer 3

使用您提供的df ...这样可以正常使用

def f1(x):
    # 1 if exist
    return pd.Series(1, set(x))

def f2(x):
    # count occurences
    return pd.value_counts(x)

print(df.set_index('ID').col2.apply(f1).fillna(0).astype(int).reset_index())
print('')
print(df.set_index('ID').col2.apply(f2).fillna(0).astype(int).reset_index())

  ID  a  b  c  d  e  f
0  1  1  1  1  0  0  0
1  2  0  0  1  1  1  1
2  3  0  1  0  0  0  1
3  4  1  1  1  0  0  0
4  5  1  1  0  0  0  0

  ID  a  b  c  d  e  f
0  1  1  1  1  0  0  0
1  2  0  0  1  1  1  1
2  3  0  1  0  0  0  2
3  4  1  1  1  0  0  0
4  5  1  2  0  0  0  0

在Python

3 个答案: