我有一个按id,dates1排序的pandas数据框,以及列id,dates1,dates2,dates3(三个不同事件的三个不同日期)。
我想迭代每一行并删除该行,如果相同id的记录的两个date1的天数差异(同一id的两个不同记录)是> 10,date2相同,date3相同。
我想过使用for循环和临时字典来存储每个id和它的日期,但它在O(时间)方面效率很低,因此O(存储)更多
这就是我的想法:
让我们说这是示例数据框
e = pd.DataFrame({
'id':[1,1,1,
1,2,2,
2],
'date1':[datetime.date(2018,10,1),datetime.date(2018,10,1),datetime.date(2018,9,29),
datetime.date(2010,3,4),datetime.date(2018,12,10),datetime.date(2018,12,4),
datetime.date(2018,11,29)],
'date2':[datetime.date(2018,10,3),datetime.date(2018,10,3),datetime.date(2018,9,29),
datetime.date(2018,9,25),datetime.date(2018,12,10),datetime.date(2018,12,4),
datetime.date(2015,1,1)],
'date3':[datetime.date(2018,10,1),datetime.date(2018,10,1),datetime.date(2018,9,27),
datetime.date(2018,9,23),datetime.date(2018,12,10),datetime.date(2018,12,3),
datetime.date(2015,1,1)]})
然后我根据此代码的先前描述删除不需要的行。
e_dict = {}
for index, row in e.iterrows():
id = row['id']
if id in e_dict:
date1_diff = abs((row['date1']-e_dict[id][-1]['date1']).days)
#print(date1_diff)
date2_diff = abs((row['date2']-ff_dict[api10][-1]['date2']).days)
#print(job_end_date_diff)
date3_diff = abs((row['date3']-ff_dict[api10][-1]['date3']).days)
#print(date3_diff)
#print('new row')
if date1_diff <= 10 and date2_diff <= 10 and date3_diff <= 10:
# drop current row from df
if id in e_dict:
e_dict.append(id)
else:
e_dict[id] = [row]
所需的输出,即新输出将是:
e = pd.DataFrame({
'id':[1,
1,2,
2],
'date1':[datetime.date(2018,10,1),
datetime.date(2010,3,4),datetime.date(2018,12,10),
datetime.date(2018,11,29)],
'date2':[datetime.date(2018,10,3),
datetime.date(2018,9,25),datetime.date(2018,12,10),
datetime.date(2015,1,1)],
'date3':[datetime.date(2018,10,1),
datetime.date(2018,9,23),datetime.date(2018,12,10),
datetime.date(2015,1,1)]})
答案 0 :(得分:1)
使用每个数据系列的移位,您可以使用它来过滤。
def diff_zero(ds):
diff = (ds.shift() - ds).apply(lambda y: y)
return ~pd.isna(diff) | (diff == 0.0)
def days_diff_less_than(ds, val):
diff = (ds.shift() - ds).apply(lambda y: y.days).abs()
return pd.isna(diff) | (diff <= val)
e = e.drop(e[days_diff_less_than(e['dates1'], 10) & diff_zero(e['id'])].index)
e = e.drop(e[days_diff_less_than(e['dates2'], 10) & diff_zero(e['id'])].index)
e = e.drop(e[days_diff_less_than(e['dates3'], 10) & diff_zero(e['id'])].index)
print(e)
# id dates1 dates2 dates3
# 0 1 2018-10-01 2018-10-01 2018-10-01
# 3 1 2010-03-04 2010-03-04 2010-03-04
# 4 2 2018-12-10 2018-12-10 2018-12-10
# 6 2 2015-01-01 2015-01-01 2015-01-01
如果相反,所有天数差异都必须小于代码更改为10
:
e = e.drop(
e[days_diff_less_than(e['date1'], 10) & days_diff_less_than(e['date2'], 10) & days_diff_less_than(e['date3'], 10) & diff_zero(e['id'])].index
)