删除重复项并添加值Pandas

时间:2017-06-06 18:23:15

标签: python pandas dataframe

我有一个数据框如下。我想删除重复项,但是将E列中的重复值添加到非重复记录

import pandas as pd
import numpy as np
dfp = pd.DataFrame({'A' : [np.NaN,np.NaN,3,4,5,5,3,1,6,7], 
                    'B' : [1,1,3,5,0,0,np.NaN,9,0,0], 
                    'C' : ['AA1233445','AA1233445', 'rmacy','Idaho Rx','Ab123455','TV192837','RX','Ohio Drugs','RX12345','USA Pharma'], 
                    'D' : [123456,123456,1234567,12345678,12345,12345,12345678,123456789,1234567,np.NaN],
                    'E' : ['Assign','Allign','Hello','Ugly','Appreciate','Undo','Testing','Unicycle','Pharma','Unicorn',]})
print(dfp)

我抓住了所有副本:

df2 = dfp.loc[(dfp['A'].duplicated(keep=False))].copy()

     A    B          C           D           E
0  NaN  1.0  AA1233445    123456.0      Assign
1  NaN  1.0  AA1233445    123456.0      Allign
2  3.0  3.0      rmacy   1234567.0       Hello
4  5.0  0.0   Ab123455     12345.0  Appreciate
5  5.0  0.0   TV192837     12345.0        Undo
6  3.0  NaN         RX  12345678.0     Testing

并希望我的结果是:

     A    B          C           D           E
0  NaN  1.0  AA1233445    123456.0      Assign Allign
2  3.0  3.0      rmacy   1234567.0      Hello Testing
4  5.0  0.0   Ab123455     12345.0      Appreciate Undo

我知道我需要使用dfp.loc[(dfp['A'].duplicated(keep='last'))].copy()来抓取第一个匹配项,但我未能设置E列的值以包含其他重复值。

我想我需要尝试类似的事情:

df3 = dfp.loc[(dfp['A'].duplicated(keep='last'))].copy()
df3['E'] = df3['E'] + dfp.loc[(dfp['A'].duplicated(keep=False).copy()),'E']

但我的输出是:

     A    B          C          D                     E
0  NaN  1.0  AA1233445   123456.0          AssignAssign
2  3.0  3.0      rmacy  1234567.0            HelloHello
4  5.0  0.0   Ab123455    12345.0  AppreciateAppreciate

我很难过。我复杂了吗?我怎样才能获得我正在寻找的输出,这样我以后可以删除所有重复项,除了第一个,但是' save' E列中已删除的值的值

2 个答案:

答案 0 :(得分:3)

定义要在agg中使用的函数,并在groupby中使用。为了让groupby使用NaN,我转换为字符串然后返回浮点数。

f = {c: ' '.join if c == 'E' else 'first' for c in ['B', 'C', 'D', 'E']}

dfp.groupby(
    dfp.A.astype(str), sort=False
).agg(f).reset_index().eval(
    'A = @pd.to_numeric(A, "coerce").values',
    inplace=False
)

     A    B           C            D                E
0  NaN  1.0   AA1233445     123456.0    Assign Allign
1  3.0  3.0       rmacy    1234567.0    Hello Testing
2  4.0  5.0    Idaho Rx   12345678.0             Ugly
3  5.0  0.0    Ab123455      12345.0  Appreciate Undo
4  1.0  9.0  Ohio Drugs  123456789.0         Unicycle
5  6.0  0.0     RX12345    1234567.0           Pharma
6  7.0  0.0  USA Pharma          NaN          Unicorn

将其限制为仅重复的行:

f = {c: ' '.join if c == 'E' else 'first' for c in ['B', 'C', 'D', 'E']}
d1 = dfp[dfp.duplicated('A', keep=False)]
d2 = d1.groupby(d1.A.astype(str), sort=False).agg(f).reset_index()
d2.A = d2.A.astype(float)

D2

     A    B          C          D                E
0  NaN  1.0  AA1233445   123456.0    Assign Allign
1  3.0  3.0      rmacy  1234567.0    Hello Testing
2  5.0  0.0   Ab123455    12345.0  Appreciate Undo

答案 1 :(得分:3)

这是我丑陋的解决方案:

In [263]: (dfp.reset_index()
     ...:     .assign(A=dfp.A.fillna(-1))
     ...:     .groupby('A')
     ...:     .filter(lambda x: len(x) > 1)
     ...:     .groupby('A', as_index=False)
     ...:     .apply(lambda x: x.head(1).assign(E=x.E.str.cat(sep=' ')))
     ...:     .replace({'A':{-1:np.nan}})
     ...:     .set_index('index'))
     ...:
Out[263]:
         A    B          C          D                E
index
0      NaN  1.0  AA1233445   123456.0    Assign Allign
2      3.0  3.0      rmacy  1234567.0    Hello Testing
4      5.0  0.0   Ab123455    12345.0  Appreciate Undo