使用另一个数据帧的唯一值创建并填充数据框

时间:2016-02-16 13:50:36

标签: python performance pandas dataframe

我有一个像这样的数据框df

  X1   X2   X3
0  a    c    a
1  b    e    c
2  c  nan    e
3  d  nan  nan

我想创建一个新的数据框newdf,其中有一列(uentries),其中包含df的唯一条目和df的三列已填充01,具体取决于uentries中相应列中是否存在df的条目。

我的所需输出因此如下所示(uentries不需要订购):

  uentries  X1  X2  X3
0        a   1   0   1
1        b   1   0   0
2        c   1   1   1
3        d   1   0   0
4        e   0   1   1

目前,我这样做:

import pandas as pd
import numpy as np

df = pd.DataFrame({'X1': ['a', 'b', 'c', 'd'],
                   'X2': ['c', 'e', 'nan', 'nan'],
                   'X3': ['a', 'c', 'e', 'nan']})

uniqueEntries = set([x for x in np.ravel(df.values) if str(x) != 'nan'])

newdf = pd.DataFrame()
newdf['uentries'] = list(uniqueEntries)

for coli in df.columns:
    newdf[coli] = newdf['uentries'].isin(df[coli])

newdf.ix[:, 'X1':'X3'] = newdf.ix[:, 'X1':'X3'].astype(int)

给了我想要的输出。

是否可以更有效地填充newdf

2 个答案:

答案 0 :(得分:1)

您可以get_dummies使用sumconcat和上次fillna

import pandas as pd

df = pd.DataFrame({'X1': ['a', 'b', 'c', 'd'],
                   'X2': ['c', 'e', 'nan', 'nan'],
                   'X3': ['a', 'c', 'e', 'nan']})
print df
  X1   X2   X3
0  a    c    a
1  b    e    c
2  c  nan    e
3  d  nan  nan

a = pd.get_dummies(df['X1']).sum()
b = pd.get_dummies(df['X2']).sum()
c = pd.get_dummies(df['X3']).sum()

print pd.concat([a,b,c], axis=1, keys=['X1','X2','X3']).fillna(0)
     X1  X2  X3
a     1   0   1
b     1   0   0
c     1   1   1
d     1   0   0
e     0   1   1
nan   0   2   1

如果您在测试数据中使用np.nan

import pandas as pd
import numpy as np
import io

df = pd.DataFrame({'X1': ['a', 'b', 'c', 'd'],
                   'X2': ['c', 'e', np.nan, np.nan],
                   'X3': ['a', 'c', 'e', np.nan]})
print df

a = pd.get_dummies(df['X1']).sum()
b = pd.get_dummies(df['X2']).sum()
c = pd.get_dummies(df['X3']).sum()

print pd.concat([a,b,c], axis=1, keys=['X1','X2','X3']).fillna(0)
   X1  X2  X3
a   1   0   1
b   1   0   0
c   1   1   1
d   1   0   0
e   0   1   1

答案 1 :(得分:1)

这是使用pd.value_counts解决此问题的简单方法。

newdf = df.apply(pd.value_counts).fillna(0)
newdf['uentries'] = newdf.index
newdf = newdf[['uentries', 'X1','X2','X3']]
newdf

uentries X1 X2 X3
a   a   1   0   1
b   b   1   0   0
c   c   1   1   1
d   d   1   0   0
e   e   0   1   1
nan nan 0   2   1

然后您可以删除包含nan值的行:

newdf.drop(['nan'])

uentries X1 X2 X3
a   a   1   0   1
b   b   1   0   0
c   c   1   1   1
d   d   1   0   0
e   e   0   1   1