我有一个数据框如下。我想删除重复项,但是将E
列中的重复值添加到非重复记录
import pandas as pd
import numpy as np
dfp = pd.DataFrame({'A' : [np.NaN,np.NaN,3,4,5,5,3,1,6,7],
'B' : [1,1,3,5,0,0,np.NaN,9,0,0],
'C' : ['AA1233445','AA1233445', 'rmacy','Idaho Rx','Ab123455','TV192837','RX','Ohio Drugs','RX12345','USA Pharma'],
'D' : [123456,123456,1234567,12345678,12345,12345,12345678,123456789,1234567,np.NaN],
'E' : ['Assign','Allign','Hello','Ugly','Appreciate','Undo','Testing','Unicycle','Pharma','Unicorn',]})
print(dfp)
我抓住了所有副本:
df2 = dfp.loc[(dfp['A'].duplicated(keep=False))].copy()
A B C D E
0 NaN 1.0 AA1233445 123456.0 Assign
1 NaN 1.0 AA1233445 123456.0 Allign
2 3.0 3.0 rmacy 1234567.0 Hello
4 5.0 0.0 Ab123455 12345.0 Appreciate
5 5.0 0.0 TV192837 12345.0 Undo
6 3.0 NaN RX 12345678.0 Testing
并希望我的结果是:
A B C D E
0 NaN 1.0 AA1233445 123456.0 Assign Allign
2 3.0 3.0 rmacy 1234567.0 Hello Testing
4 5.0 0.0 Ab123455 12345.0 Appreciate Undo
我知道我需要使用dfp.loc[(dfp['A'].duplicated(keep='last'))].copy()
来抓取第一个匹配项,但我未能设置E
列的值以包含其他重复值。
我想我需要尝试类似的事情:
df3 = dfp.loc[(dfp['A'].duplicated(keep='last'))].copy()
df3['E'] = df3['E'] + dfp.loc[(dfp['A'].duplicated(keep=False).copy()),'E']
但我的输出是:
A B C D E
0 NaN 1.0 AA1233445 123456.0 AssignAssign
2 3.0 3.0 rmacy 1234567.0 HelloHello
4 5.0 0.0 Ab123455 12345.0 AppreciateAppreciate
我很难过。我复杂了吗?我怎样才能获得我正在寻找的输出,这样我以后可以删除所有重复项,除了第一个,但是' save' E
列中已删除的值的值
答案 0 :(得分:3)
定义要在agg
中使用的函数,并在groupby
中使用。为了让groupby使用NaN,我转换为字符串然后返回浮点数。
f = {c: ' '.join if c == 'E' else 'first' for c in ['B', 'C', 'D', 'E']}
dfp.groupby(
dfp.A.astype(str), sort=False
).agg(f).reset_index().eval(
'A = @pd.to_numeric(A, "coerce").values',
inplace=False
)
A B C D E
0 NaN 1.0 AA1233445 123456.0 Assign Allign
1 3.0 3.0 rmacy 1234567.0 Hello Testing
2 4.0 5.0 Idaho Rx 12345678.0 Ugly
3 5.0 0.0 Ab123455 12345.0 Appreciate Undo
4 1.0 9.0 Ohio Drugs 123456789.0 Unicycle
5 6.0 0.0 RX12345 1234567.0 Pharma
6 7.0 0.0 USA Pharma NaN Unicorn
将其限制为仅重复的行:
f = {c: ' '.join if c == 'E' else 'first' for c in ['B', 'C', 'D', 'E']}
d1 = dfp[dfp.duplicated('A', keep=False)]
d2 = d1.groupby(d1.A.astype(str), sort=False).agg(f).reset_index()
d2.A = d2.A.astype(float)
D2
A B C D E
0 NaN 1.0 AA1233445 123456.0 Assign Allign
1 3.0 3.0 rmacy 1234567.0 Hello Testing
2 5.0 0.0 Ab123455 12345.0 Appreciate Undo
答案 1 :(得分:3)
这是我丑陋的解决方案:
In [263]: (dfp.reset_index()
...: .assign(A=dfp.A.fillna(-1))
...: .groupby('A')
...: .filter(lambda x: len(x) > 1)
...: .groupby('A', as_index=False)
...: .apply(lambda x: x.head(1).assign(E=x.E.str.cat(sep=' ')))
...: .replace({'A':{-1:np.nan}})
...: .set_index('index'))
...:
Out[263]:
A B C D E
index
0 NaN 1.0 AA1233445 123456.0 Assign Allign
2 3.0 3.0 rmacy 1234567.0 Hello Testing
4 5.0 0.0 Ab123455 12345.0 Appreciate Undo