基于条件的Pandas中的drop_duplicates()

时间:2014-12-23 20:15:38

标签: python pandas

拥有以下数据集:

A,B,C,D
1,A_Task,WID,WI_DTL
1,A_adhoc_load,ATT,IXN_1
1,A_adhoc_load,ATT,IXN_10
1,A_adhoc_load,ATT,IXN_100
1,A_adhoc_load,ATT,IXN_101
2,Batch_Support,ATT,CDS_STATUS
2,Batch_Support,ATT,CDS_CONTROL
2,Batch_Support,ATT,CDS_ORA_STATUS
2,Batch_Support,ATT,REP_FILTER
1,online_load,ATT,TAX_3
1,online_load,ATT,TAX_4
1,online_load,ATT,TAX_8
1,online_load,ATT,TAX_11

想要输出如下:

A,B,C,D
1,A_Task,WID,WI_DTL
1,A_adhoc_load,ATT,IXN
2,Batch_Support,ATT,CDS_STATUS
2,Batch_Support,ATT,CDS_CONTROL
2,Batch_Support,ATT,CDS_ORA_STATUS
2,Batch_Support,ATT,REP_FILTER
1,online_load,ATT,TAX

即。想要删除“D”中的元素,其中复制的形式为%_ [0-9] +

已执行以下步骤:

import pandas as pd


cs = pd.read_csv('inp.csv')

cs["NEW"] = cs.D.str.match('([A-Z]+)\_[0-9]+')
print cs


    A                B    C               D      NEW
0   1       Adhoc_Task  WID          WI_DTL      []  
1   1  Arun_adhoc_load  ATT           IXN_1  (IXN,)
2   1  Arun_adhoc_load  ATT          IXN_10  (IXN,)
3   1  Arun_adhoc_load  ATT         IXN_100  (IXN,)
4   1  Arun_adhoc_load  ATT         IXN_101  (IXN,)
5   2    Batch_Support  ATT      CDS_STATUS      []
6   2    Batch_Support  ATT     CDS_CONTROL      []
7   2    Batch_Support  ATT  CDS_ORA_STATUS      []
8   2    Batch_Support  ATT      REP_FILTER      []
9   1      online_load  ATT           TAX_3  (TAX,)
10  1      online_load  ATT           TAX_4  (TAX,)
11  1      online_load  ATT           TAX_8  (TAX,)
12  1      online_load  ATT          TAX_11  (TAX,)


cs_new=cs[cs.NEW != []]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/anaconda/lib/python2.7/site-packages/pandas/core/ops.py", line 572, in wrapper
    res = na_op(values, other)
  File "/usr/anaconda/lib/python2.7/site-packages/pandas/core/ops.py", line 533, in na_op
    result = lib.vec_compare(x, y, op)
  File "lib.pyx", line 671, in pandas.lib.vec_compare (pandas/lib.c:12404)
ValueError: Arrays were different lengths: 13 vs 0

cs_new=cs[cs.NEW == []]
Traceback (most recent call last):
  File "/usr/anaconda/lib/python2.7/site-packages/pandas/core/ops.py", line 572, in wrapper
  File "/usr/anaconda/lib/python2.7/site-packages/pandas/core/ops.py", line 572, in wrapper
    res = na_op(values, other)
  File "/usr/anaconda/lib/python2.7/site-packages/pandas/core/ops.py", line 533, in na_op
    result = lib.vec_compare(x, y, op)
  File "lib.pyx", line 671, in pandas.lib.vec_compare (pandas/lib.c:12404)
ValueError: Arrays were different lengths: 13 vs 0


cs.drop_duplicates('NEW')

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/anaconda/lib/python2.7/site-packages/pandas/util/decorators.py", line 60, in wrapper
return func(*args, **kwargs)
File "/usr/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 2590, in drop_duplicates

duplicated = self.duplicated(subset, take_last=take_last)
File "/usr/anaconda/lib/python2.7/site-packages/pandas/util/decorators.py", line 60, in wrapper
return func(*args, **kwargs)
File "/usr/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 2639, in duplicated
duplicated = lib.duplicated(keys, take_last=take_last)
File "lib.pyx", line 1202, in pandas.lib.duplicated (pandas/lib.c:20180)
TypeError: unhashable type: 'list'

我的想法是:

1. Split based on value of D. 1st DF having D==[] and 2nd DF having D!=[]
2. Remove duplicate using col "NEW".
3. Append the DFs.
4. Then Drop column "New" to obtain the final result.

我也在下面尝试过:

cs['NEW'].drop_duplicates().values.tolist()

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/anaconda/lib/python2.7/site-packages/pandas/core/series.py", line 1128, in drop_duplicates
    duplicated = self.duplicated(take_last=take_last)
  File "/usr/anaconda/lib/python2.7/site-packages/pandas/core/series.py", line 1149, in duplicated
    duplicated = lib.duplicated(keys, take_last=take_last)
  File "lib.pyx", line 1202, in pandas.lib.duplicated (pandas/lib.c:20180)
TypeError: unhashable type: 'list'

list(set(cs['NEW']))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

请帮忙..

1 个答案:

答案 0 :(得分:0)

您可以使用str.replace删除D中字符串的结束数字部分:

In [204]: df['D'] = df['D'].str.replace(r'_[0-9]+$', '')

In [205]: df
Out[205]: 
    A              B    C               D
0   1         A_Task  WID          WI_DTL
1   1   A_adhoc_load  ATT             IXN
2   1   A_adhoc_load  ATT             IXN
3   1   A_adhoc_load  ATT             IXN
4   1   A_adhoc_load  ATT             IXN
5   2  Batch_Support  ATT      CDS_STATUS
6   2  Batch_Support  ATT     CDS_CONTROL
7   2  Batch_Support  ATT  CDS_ORA_STATUS
8   2  Batch_Support  ATT      REP_FILTER
9   1    online_load  ATT             TAX
10  1    online_load  ATT             TAX
11  1    online_load  ATT             TAX
12  1    online_load  ATT             TAX

然后使用groupby/first

In [216]: df.groupby('D', as_index=False, sort=False).first()[list('ABCD')]
Out[216]: 
   A              B    C               D
0  1         A_Task  WID          WI_DTL
1  1   A_adhoc_load  ATT             IXN
2  2  Batch_Support  ATT      CDS_STATUS
3  2  Batch_Support  ATT     CDS_CONTROL
4  2  Batch_Support  ATT  CDS_ORA_STATUS
5  2  Batch_Support  ATT      REP_FILTER
6  1    online_load  ATT             TAX