我的数据如下所示。我有一个groupby来对Visit_id进行分组,但现在我想要删除所有行,除非Visit_id有两个在一周之内的Visit_time。
df allVisits:
Visit_id Visit_time
162 2009-01-21 00:00:00.000
162 2012-09-05 00:00:00.000
213 2010-06-21 00:00:00.000
213 2010-06-22 00:00:00.000
216 2011-07-06 00:00:00.000
216 2012-04-11 00:00:00.000
216 2012-04-12 00:00:00.000
我希望它看起来像:
Visit_id Visit_time
213 2010-06-21 00:00:00.000
213 2010-06-22 00:00:00.000
216 2012-04-11 00:00:00.000
216 2012-04-12 00:00:00.000
目前我的代码是:
allVisits.groupby(['Visit_id']).apply()
我可以从这里做些什么?
提前致谢!
答案 0 :(得分:1)
解读1
如果您要保留所有记录的Visit_id
在一周内至少有两个记录,这是一种方法。
df.sort_values(['Visit_id', 'Visit_time'], inplace=True) # sort the rows by date
# shift the records within each group to find the time difference
# between the dates of the records
df['time_shift'] = df.groupby('Visit_id')['Visit_time'].transform(lambda x: x.shift())
df['time_diff'] = (df['time_shift'] - df['Visit_time']).dt.days
# filter the dataframe on the Visit_ids that have dates within 7 days of each other
df.groupby('Visit_id').filter(lambda x: (abs(x['time_diff']) <= 7).any())
# Visit_id Visit_time time_shift time_diff
# 2 213 2010-06-21 NaT NaN
# 3 213 2010-06-22 2010-06-21 -1.0
# 4 216 2011-07-06 NaT NaN
# 5 216 2012-04-11 2011-07-06 -280.0
# 6 216 2012-04-12 2012-04-11 -1.0
解读2
如果您的意思是仅保留彼此在7天内的记录,请尝试以下解决方案。
df.sort_values(['Visit_id', 'Visit_time'], inplace=True) # sort the rows by date
# shift the records within each group to find the time difference
# between the dates of the records
df['time_shift'] = df.groupby('Visit_id')['Visit_time'].transform(lambda x: x.shift())
df['time_diff'] = (df['time_shift'] - df['Visit_time']).dt.days
df['keep_idx'] = df.groupby('Visit_id')['time_diff'].transform(lambda x: abs(x) <= 7)
# we need to undo the shift we performed before and make sure that
# we capture both records involved. Hence the OR operation.
df['keep_idx'] = df['keep_idx'] | \
df.groupby('Visit_id')['keep_idx'].transform(lambda x: x.shift(-1)
df.loc[df['keep_idx'] > 0] # subset on the indices we want
# Visit_id Visit_time time_shift time_diff
# 2 213 2010-06-21 NaT NaN
# 3 213 2010-06-22 2010-06-21 -1.0
# 5 216 2012-04-11 2011-07-06 -280.0
# 6 216 2012-04-12 2012-04-11 -1.0
如果您愿意,可以将其中任何一个转换为函数并使用apply
方法,但为了清晰起见,上述解决方案将逐行提供。