删除重复的行,但将具有特定值的行保留在一列中(pandas python)

时间:2016-06-12 18:25:04

标签: python pandas dataframe duplicates

我想做以下事情:

如果两行在3列中具有完全相同的值(" ID","符号"和"日期")并且具有" X"或" T"在一列("消息")中,然后删除这两行。但是,如果两行在相同的3列中具有相同的值,但值不同于" X"或" T"在另一列,然后保持完整。

以下是我的数据框的示例:

df = pd.DataFrame({"ID":["AA-1", "AA-1", "C-0" ,"BB-2", "BB-2"], "symbol":["A","A","C","B","B"], "date":["06/24/2014","06/24/2014","06/20/2013","06/25/2014","06/25/2015"], "message": ["T","X","T","",""] })

请注意,前两行的列具有相同的值" ID"," symbol"和" date"和" T"和" X"在列"消息"。我想删除这两行。

但是,最后两行在列" ID","符号"和"日期"中具有相同的值,但是空白(不同于&# 34; X"或" T")列"消息"。

我有兴趣将该函数应用于具有数百万行的大型数据集。到目前为止,我所尝试的内容消耗了我所有的记忆,

谢谢你,我感谢任何帮助,

2 个答案:

答案 0 :(得分:0)

我认为您可以将groupbyfilter一起使用 - 条件为 - 不是2行具有重复值,而messageisin中没有值TX

import pandas as pd

df = pd.DataFrame({"ID":["AA-1", "AA-1", "C-0" ,"BB-2", "BB-2"],
                   "symbol":["A","A","C","B","B"],
                   "date":["06/24/2014","06/24/2014","06/20/2013","06/25/2015","06/25/2015"],
                   "message": ["T","X","T","",""] })
print (df) 
     ID        date message symbol
0  AA-1  06/24/2014       T      A
1  AA-1  06/24/2014       X      A
2   C-0  06/20/2013       T      C
3  BB-2  06/25/2015              B
4  BB-2  06/25/2015              B

df1 = df.groupby(['ID','date','symbol']).filter(lambda x: ~((len(x) == 2) & 
                                                          (x.message.isin(['T','X']).all())))
print (df1)
     ID        date message symbol
2   C-0  06/20/2013       T      C
3  BB-2  06/25/2015              B
4  BB-2  06/25/2015              B

Filtration in docs

comment编辑:

import pandas as pd

df = pd.DataFrame({"ID":["AA-1", "AA-1", "C-0", "C-0","BB-2", "BB-2"],
                   "symbol":["A","A","C","C", "B","B"],
                   "date":["06/24/2014","06/24/2014","06/20/2013","06/20/2013","06/25/2015","06/25/2015"], 
                   "message": ["T","X","X","X","",""] })
print (df) 
     ID        date message symbol
0  AA-1  06/24/2014       T      A
1  AA-1  06/24/2014       X      A
2   C-0  06/20/2013       X      C
3   C-0  06/20/2013       X      C
4  BB-2  06/25/2015              B
5  BB-2  06/25/2015              B

如果需要删除每个组中XT的值 - 这意味着它也会删除双X或双T,每个len小组总是2

df1 = df.groupby(['ID','date','symbol']).filter(lambda x: ~x.message.isin(['T','X']).all())
print (df1)
     ID        date message symbol
4  BB-2  06/25/2015              B
5  BB-2  06/25/2015              B

如果只需删除值为TX的组,您可以先message filter然后T检查第一个值是否为X每组中X和第二df2 = df.sort_values('message') .groupby(['ID','date','symbol'], sort=False) .filter(lambda x: ((x.message.iloc[0] != 'T') | (x.message.iloc[1] != 'X'))) print (df2) ID date message symbol 4 BB-2 06/25/2015 B 5 BB-2 06/25/2015 B 2 C-0 06/20/2013 X C 3 C-0 06/20/2013 X C 。 ('T'是第一个,<label for="username">Click me</label> <input type="text" id="username"> 是第二个,因为排序):

Object[]

答案 1 :(得分:0)

这可能对您有用:

vals = ['X', 'T']
pd.concat([df[~df.message.isin(vals)], df[df.message.isin(vals)].loc[~df.duplicated(subset=['ID', 'date', 'symbol'], keep=False), :]])

     ID        date message symbol
3  BB-2  06/25/2014              B
4  BB-2  06/25/2015              B
2   C-0  06/20/2013       T      C

速度相当快:

%%timeit
pd.concat([df[~df.message.isin(['X', 'T'])], df[df.message.isin(['X', 'T'])].loc[~df.duplicated(subset=['ID', 'date', 'symbol'], keep=False), :]])
100 loops, best of 3: 1.99 ms per loop

%%timeit
df.groupby(['ID','date','symbol']).filter(lambda x: ~x.message.isin(['T','X']).all())
100 loops, best of 3: 2.71 ms per loop

替代方案是给出索引错误。