拥有以下数据集:
A,B,C,D
1,A_Task,WID,WI_DTL
1,A_adhoc_load,ATT,IXN_1
1,A_adhoc_load,ATT,IXN_10
1,A_adhoc_load,ATT,IXN_100
1,A_adhoc_load,ATT,IXN_101
2,Batch_Support,ATT,CDS_STATUS
2,Batch_Support,ATT,CDS_CONTROL
2,Batch_Support,ATT,CDS_ORA_STATUS
2,Batch_Support,ATT,REP_FILTER
1,online_load,ATT,TAX_3
1,online_load,ATT,TAX_4
1,online_load,ATT,TAX_8
1,online_load,ATT,TAX_11
想要输出如下:
A,B,C,D
1,A_Task,WID,WI_DTL
1,A_adhoc_load,ATT,IXN
2,Batch_Support,ATT,CDS_STATUS
2,Batch_Support,ATT,CDS_CONTROL
2,Batch_Support,ATT,CDS_ORA_STATUS
2,Batch_Support,ATT,REP_FILTER
1,online_load,ATT,TAX
即。想要删除“D”中的元素,其中复制的形式为%_ [0-9] +
import pandas as pd
cs = pd.read_csv('inp.csv')
cs["NEW"] = cs.D.str.match('([A-Z]+)\_[0-9]+')
print cs
A B C D NEW
0 1 Adhoc_Task WID WI_DTL []
1 1 Arun_adhoc_load ATT IXN_1 (IXN,)
2 1 Arun_adhoc_load ATT IXN_10 (IXN,)
3 1 Arun_adhoc_load ATT IXN_100 (IXN,)
4 1 Arun_adhoc_load ATT IXN_101 (IXN,)
5 2 Batch_Support ATT CDS_STATUS []
6 2 Batch_Support ATT CDS_CONTROL []
7 2 Batch_Support ATT CDS_ORA_STATUS []
8 2 Batch_Support ATT REP_FILTER []
9 1 online_load ATT TAX_3 (TAX,)
10 1 online_load ATT TAX_4 (TAX,)
11 1 online_load ATT TAX_8 (TAX,)
12 1 online_load ATT TAX_11 (TAX,)
cs_new=cs[cs.NEW != []]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/anaconda/lib/python2.7/site-packages/pandas/core/ops.py", line 572, in wrapper
res = na_op(values, other)
File "/usr/anaconda/lib/python2.7/site-packages/pandas/core/ops.py", line 533, in na_op
result = lib.vec_compare(x, y, op)
File "lib.pyx", line 671, in pandas.lib.vec_compare (pandas/lib.c:12404)
ValueError: Arrays were different lengths: 13 vs 0
cs_new=cs[cs.NEW == []]
Traceback (most recent call last):
File "/usr/anaconda/lib/python2.7/site-packages/pandas/core/ops.py", line 572, in wrapper
File "/usr/anaconda/lib/python2.7/site-packages/pandas/core/ops.py", line 572, in wrapper
res = na_op(values, other)
File "/usr/anaconda/lib/python2.7/site-packages/pandas/core/ops.py", line 533, in na_op
result = lib.vec_compare(x, y, op)
File "lib.pyx", line 671, in pandas.lib.vec_compare (pandas/lib.c:12404)
ValueError: Arrays were different lengths: 13 vs 0
cs.drop_duplicates('NEW')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/anaconda/lib/python2.7/site-packages/pandas/util/decorators.py", line 60, in wrapper
return func(*args, **kwargs)
File "/usr/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 2590, in drop_duplicates
duplicated = self.duplicated(subset, take_last=take_last)
File "/usr/anaconda/lib/python2.7/site-packages/pandas/util/decorators.py", line 60, in wrapper
return func(*args, **kwargs)
File "/usr/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 2639, in duplicated
duplicated = lib.duplicated(keys, take_last=take_last)
File "lib.pyx", line 1202, in pandas.lib.duplicated (pandas/lib.c:20180)
TypeError: unhashable type: 'list'
我的想法是:
1. Split based on value of D. 1st DF having D==[] and 2nd DF having D!=[]
2. Remove duplicate using col "NEW".
3. Append the DFs.
4. Then Drop column "New" to obtain the final result.
我也在下面尝试过:
cs['NEW'].drop_duplicates().values.tolist()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/anaconda/lib/python2.7/site-packages/pandas/core/series.py", line 1128, in drop_duplicates
duplicated = self.duplicated(take_last=take_last)
File "/usr/anaconda/lib/python2.7/site-packages/pandas/core/series.py", line 1149, in duplicated
duplicated = lib.duplicated(keys, take_last=take_last)
File "lib.pyx", line 1202, in pandas.lib.duplicated (pandas/lib.c:20180)
TypeError: unhashable type: 'list'
list(set(cs['NEW']))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
请帮忙..
答案 0 :(得分:0)
您可以使用str.replace
删除D
中字符串的结束数字部分:
In [204]: df['D'] = df['D'].str.replace(r'_[0-9]+$', '')
In [205]: df
Out[205]:
A B C D
0 1 A_Task WID WI_DTL
1 1 A_adhoc_load ATT IXN
2 1 A_adhoc_load ATT IXN
3 1 A_adhoc_load ATT IXN
4 1 A_adhoc_load ATT IXN
5 2 Batch_Support ATT CDS_STATUS
6 2 Batch_Support ATT CDS_CONTROL
7 2 Batch_Support ATT CDS_ORA_STATUS
8 2 Batch_Support ATT REP_FILTER
9 1 online_load ATT TAX
10 1 online_load ATT TAX
11 1 online_load ATT TAX
12 1 online_load ATT TAX
然后使用groupby/first
:
In [216]: df.groupby('D', as_index=False, sort=False).first()[list('ABCD')]
Out[216]:
A B C D
0 1 A_Task WID WI_DTL
1 1 A_adhoc_load ATT IXN
2 2 Batch_Support ATT CDS_STATUS
3 2 Batch_Support ATT CDS_CONTROL
4 2 Batch_Support ATT CDS_ORA_STATUS
5 2 Batch_Support ATT REP_FILTER
6 1 online_load ATT TAX