我有一个像这样的数据框df
:
X1 X2 X3
0 a c a
1 b e c
2 c nan e
3 d nan nan
我想创建一个新的数据框newdf
,其中有一列(uentries
),其中包含df
的唯一条目和df
的三列已填充0
和1
,具体取决于uentries
中相应列中是否存在df
的条目。
我的所需输出因此如下所示(uentries
不需要订购):
uentries X1 X2 X3
0 a 1 0 1
1 b 1 0 0
2 c 1 1 1
3 d 1 0 0
4 e 0 1 1
目前,我这样做:
import pandas as pd
import numpy as np
df = pd.DataFrame({'X1': ['a', 'b', 'c', 'd'],
'X2': ['c', 'e', 'nan', 'nan'],
'X3': ['a', 'c', 'e', 'nan']})
uniqueEntries = set([x for x in np.ravel(df.values) if str(x) != 'nan'])
newdf = pd.DataFrame()
newdf['uentries'] = list(uniqueEntries)
for coli in df.columns:
newdf[coli] = newdf['uentries'].isin(df[coli])
newdf.ix[:, 'X1':'X3'] = newdf.ix[:, 'X1':'X3'].astype(int)
给了我想要的输出。
是否可以更有效地填充newdf
?
答案 0 :(得分:1)
您可以get_dummies
使用sum
,concat
和上次fillna
:
import pandas as pd
df = pd.DataFrame({'X1': ['a', 'b', 'c', 'd'],
'X2': ['c', 'e', 'nan', 'nan'],
'X3': ['a', 'c', 'e', 'nan']})
print df
X1 X2 X3
0 a c a
1 b e c
2 c nan e
3 d nan nan
a = pd.get_dummies(df['X1']).sum()
b = pd.get_dummies(df['X2']).sum()
c = pd.get_dummies(df['X3']).sum()
print pd.concat([a,b,c], axis=1, keys=['X1','X2','X3']).fillna(0)
X1 X2 X3
a 1 0 1
b 1 0 0
c 1 1 1
d 1 0 0
e 0 1 1
nan 0 2 1
如果您在测试数据中使用np.nan
:
import pandas as pd
import numpy as np
import io
df = pd.DataFrame({'X1': ['a', 'b', 'c', 'd'],
'X2': ['c', 'e', np.nan, np.nan],
'X3': ['a', 'c', 'e', np.nan]})
print df
a = pd.get_dummies(df['X1']).sum()
b = pd.get_dummies(df['X2']).sum()
c = pd.get_dummies(df['X3']).sum()
print pd.concat([a,b,c], axis=1, keys=['X1','X2','X3']).fillna(0)
X1 X2 X3
a 1 0 1
b 1 0 0
c 1 1 1
d 1 0 0
e 0 1 1
答案 1 :(得分:1)
这是使用pd.value_counts
解决此问题的简单方法。
newdf = df.apply(pd.value_counts).fillna(0)
newdf['uentries'] = newdf.index
newdf = newdf[['uentries', 'X1','X2','X3']]
newdf
uentries X1 X2 X3
a a 1 0 1
b b 1 0 0
c c 1 1 1
d d 1 0 0
e e 0 1 1
nan nan 0 2 1
然后您可以删除包含nan
值的行:
newdf.drop(['nan'])
uentries X1 X2 X3
a a 1 0 1
b b 1 0 0
c c 1 1 1
d d 1 0 0
e e 0 1 1