如何保持Pandas Dataframe的行,其中两个条目在一周之内?

时间:2018-02-20 02:36:21

标签: pandas jupyter data-science

我的数据如下所示。我有一个groupby来对Visit_id进行分组,但现在我想要删除所有行,除非Visit_id有两个在一周之内的Visit_time。

df allVisits:

Visit_id    Visit_time
162         2009-01-21 00:00:00.000
162         2012-09-05 00:00:00.000
213         2010-06-21 00:00:00.000
213         2010-06-22 00:00:00.000 
216         2011-07-06 00:00:00.000
216         2012-04-11 00:00:00.000
216         2012-04-12 00:00:00.000

我希望它看起来像:

Visit_id    Visit_time
213         2010-06-21 00:00:00.000
213         2010-06-22 00:00:00.000 
216         2012-04-11 00:00:00.000
216         2012-04-12 00:00:00.000

目前我的代码是:

allVisits.groupby(['Visit_id']).apply()

我可以从这里做些什么?

提前致谢!

1 个答案:

答案 0 :(得分:1)

解读1

如果您要保留所有记录的Visit_id在一周内至少有两个记录,这是一种方法。

df.sort_values(['Visit_id', 'Visit_time'], inplace=True)  # sort the rows by date

# shift the records within each group to find the time difference
# between the dates of the records
df['time_shift'] = df.groupby('Visit_id')['Visit_time'].transform(lambda x: x.shift())
df['time_diff'] = (df['time_shift'] - df['Visit_time']).dt.days

# filter the dataframe on the Visit_ids that have dates within 7 days of each other
df.groupby('Visit_id').filter(lambda x: (abs(x['time_diff']) <= 7).any())

#    Visit_id Visit_time  time_shift  time_diff
# 2       213 2010-06-21         NaT        NaN
# 3       213 2010-06-22  2010-06-21       -1.0
# 4       216 2011-07-06         NaT        NaN
# 5       216 2012-04-11  2011-07-06     -280.0
# 6       216 2012-04-12  2012-04-11       -1.0

解读2

如果您的意思是仅保留彼此在7天内的记录,请尝试以下解决方案。

df.sort_values(['Visit_id', 'Visit_time'], inplace=True)  # sort the rows by date

# shift the records within each group to find the time difference
# between the dates of the records
df['time_shift'] = df.groupby('Visit_id')['Visit_time'].transform(lambda x: x.shift())
df['time_diff'] = (df['time_shift'] - df['Visit_time']).dt.days

df['keep_idx'] = df.groupby('Visit_id')['time_diff'].transform(lambda x: abs(x) <= 7)
# we need to undo the shift we performed before and make sure that
# we capture both records involved. Hence the OR operation.
df['keep_idx'] = df['keep_idx'] | \
     df.groupby('Visit_id')['keep_idx'].transform(lambda x: x.shift(-1)
df.loc[df['keep_idx'] > 0]  # subset on the indices we want

#    Visit_id Visit_time  time_shift  time_diff
# 2       213 2010-06-21         NaT        NaN
# 3       213 2010-06-22  2010-06-21       -1.0
# 5       216 2012-04-11  2011-07-06     -280.0
# 6       216 2012-04-12  2012-04-11       -1.0

如果您愿意,可以将其中任何一个转换为函数并使用apply方法,但为了清晰起见,上述解决方案将逐行提供。