我想做以下事情:
如果两行在3列中具有完全相同的值(" ID","符号"和"日期")并且具有" X"或" T"在一列("消息")中,然后删除这两行。但是,如果两行在相同的3列中具有相同的值,但值不同于" X"或" T"在另一列,然后保持完整。
以下是我的数据框的示例:
df = pd.DataFrame({"ID":["AA-1", "AA-1", "C-0" ,"BB-2", "BB-2"], "symbol":["A","A","C","B","B"], "date":["06/24/2014","06/24/2014","06/20/2013","06/25/2014","06/25/2015"], "message": ["T","X","T","",""] })
请注意,前两行的列具有相同的值" ID"," symbol"和" date"和" T"和" X"在列"消息"。我想删除这两行。
但是,最后两行在列" ID","符号"和"日期"中具有相同的值,但是空白(不同于&# 34; X"或" T")列"消息"。
我有兴趣将该函数应用于具有数百万行的大型数据集。到目前为止,我所尝试的内容消耗了我所有的记忆,
谢谢你,我感谢任何帮助,
答案 0 :(得分:0)
我认为您可以将groupby
与filter
一起使用 - 条件为 - 不是2
行具有重复值,而message
组isin
中没有值T
或X
:
import pandas as pd
df = pd.DataFrame({"ID":["AA-1", "AA-1", "C-0" ,"BB-2", "BB-2"],
"symbol":["A","A","C","B","B"],
"date":["06/24/2014","06/24/2014","06/20/2013","06/25/2015","06/25/2015"],
"message": ["T","X","T","",""] })
print (df)
ID date message symbol
0 AA-1 06/24/2014 T A
1 AA-1 06/24/2014 X A
2 C-0 06/20/2013 T C
3 BB-2 06/25/2015 B
4 BB-2 06/25/2015 B
df1 = df.groupby(['ID','date','symbol']).filter(lambda x: ~((len(x) == 2) &
(x.message.isin(['T','X']).all())))
print (df1)
ID date message symbol
2 C-0 06/20/2013 T C
3 BB-2 06/25/2015 B
4 BB-2 06/25/2015 B
按comment编辑:
import pandas as pd
df = pd.DataFrame({"ID":["AA-1", "AA-1", "C-0", "C-0","BB-2", "BB-2"],
"symbol":["A","A","C","C", "B","B"],
"date":["06/24/2014","06/24/2014","06/20/2013","06/20/2013","06/25/2015","06/25/2015"],
"message": ["T","X","X","X","",""] })
print (df)
ID date message symbol
0 AA-1 06/24/2014 T A
1 AA-1 06/24/2014 X A
2 C-0 06/20/2013 X C
3 C-0 06/20/2013 X C
4 BB-2 06/25/2015 B
5 BB-2 06/25/2015 B
如果需要删除每个组中X
或T
的值 - 这意味着它也会删除双X
或双T
,每个len
小组总是2
:
df1 = df.groupby(['ID','date','symbol']).filter(lambda x: ~x.message.isin(['T','X']).all())
print (df1)
ID date message symbol
4 BB-2 06/25/2015 B
5 BB-2 06/25/2015 B
如果只需删除值为T
和X
的组,您可以先message
filter
然后T
检查第一个值是否为X
每组中X
和第二df2 = df.sort_values('message')
.groupby(['ID','date','symbol'], sort=False)
.filter(lambda x: ((x.message.iloc[0] != 'T') | (x.message.iloc[1] != 'X')))
print (df2)
ID date message symbol
4 BB-2 06/25/2015 B
5 BB-2 06/25/2015 B
2 C-0 06/20/2013 X C
3 C-0 06/20/2013 X C
。 ('T'是第一个,<label for="username">Click me</label>
<input type="text" id="username">
是第二个,因为排序):
Object[]
答案 1 :(得分:0)
这可能对您有用:
vals = ['X', 'T']
pd.concat([df[~df.message.isin(vals)], df[df.message.isin(vals)].loc[~df.duplicated(subset=['ID', 'date', 'symbol'], keep=False), :]])
ID date message symbol
3 BB-2 06/25/2014 B
4 BB-2 06/25/2015 B
2 C-0 06/20/2013 T C
速度相当快:
%%timeit
pd.concat([df[~df.message.isin(['X', 'T'])], df[df.message.isin(['X', 'T'])].loc[~df.duplicated(subset=['ID', 'date', 'symbol'], keep=False), :]])
100 loops, best of 3: 1.99 ms per loop
%%timeit
df.groupby(['ID','date','symbol']).filter(lambda x: ~x.message.isin(['T','X']).all())
100 loops, best of 3: 2.71 ms per loop
替代方案是给出索引错误。