我有一个pandas数据帧,其中每一行代表一段时间。有一个start_date
列,其中包含期间开始的日期时间,还有一个end_date
列,其中包含期间结束的日期时间。
通常,我将通过过滤其开始日期和结束日期相对于其他日期的行来查询此数据集。例如,如果我想过滤期间包含特定日期的所有行,我会这样做:
df = df[(df['start_date' < my_date]) & ('end_date' > my_date)]
有很多像这样的查询很好,没有问题。我可以做类似于过滤两个日期之间或两个日期之间结束的期间,或者由两个日期包围。这一切都很好,很好。
我现在有一个与我想要运行的日期相关的不同标准:我想查找其周期包含一年中某一天的所有行(例如4月6日)。
那么,如何查找此数据框中的所有行,其开始日期和结束日期都包含日期值为6且月份值为4的日期?
我能想到的唯一方法是构建4月6日(1970-04-06,1971-04-06 ...)的日期范围,然后对所有日期进行上述日期比较,并采用所有这些比较的按位OR来获得包含4月6日日期之一的句点索引。
这会起作用,但感觉就像一块垃圾。是否有更优雅的方式来完成这项任务?
编辑:根据要求,以下是开始日期,结束日期以及它们是否包含4月6日的示例:
start_date end_date contains_april_6th
0 2012-07-10 23:06:07 2014-12-30 15:07:09 True
1 2011-10-21 13:56:20 2011-10-25 12:54:26 False
2 2011-04-23 03:25:00 2013-12-10 23:08:02 True
3 2014-08-10 01:14:03 2016-01-23 22:23:52 True
4 2017-04-23 17:15:43 2017-06-03 14:02:19 False
5 2012-10-01 01:42:00 2016-02-05 04:17:16 True
6 2015-08-21 19:59:45 2015-09-06 03:58:59 False
7 2016-06-10 23:03:14 2016-10-26 23:00:53 False
8 2011-08-12 20:50:34 2013-09-22 23:03:05 True
9 2012-11-04 00:45:24 2013-08-15 02:39:36 True
10 2016-03-28 07:51:55 2017-10-07 13:07:57 True
11 2010-08-03 06:42:04 2012-01-04 09:04:00 True
12 2010-05-11 01:22:27 2012-12-30 05:51:48 True
13 2011-12-30 01:19:28 2014-06-20 09:13:32 True
14 2011-11-12 18:04:15 2013-01-05 19:47:20 True
15 2014-04-13 10:40:25 2015-06-08 11:00:44 True
16 2012-10-23 05:43:02 2014-11-04 11:53:35 True
17 2011-03-18 05:00:50 2012-07-30 17:24:36 True
18 2010-04-24 10:00:09 2011-02-21 23:26:50 False
19 2012-02-09 08:05:07 2016-04-06 14:24:22 True
20 2012-04-05 08:39:50 2014-03-11 01:22:43 True
21 2011-07-30 19:19:32 2017-02-14 17:04:36 True
22 2010-01-02 23:48:45 2012-12-16 14:42:24 True
23 2011-11-08 23:49:27 2012-03-11 22:17:14 False
24 2013-02-25 02:27:50 2014-08-02 00:09:07 True
25 2010-07-11 18:56:22 2014-09-06 20:38:50 True
26 2012-11-23 02:08:11 2013-02-26 18:51:47 False
27 2011-05-31 09:02:02 2017-08-01 01:58:09 True
28 2012-10-06 10:09:32 2017-01-03 00:29:34 True
29 2015-10-18 22:37:40 2017-10-06 17:05:42 True
30 2014-06-08 12:10:40 2017-09-30 02:23:40 True
31 2014-06-15 09:33:27 2015-04-17 06:10:28 True
32 2015-06-04 07:32:59 2017-09-22 21:35:34 True
33 2012-09-05 22:24:04 2013-04-18 04:04:50 True
34 2012-07-09 10:44:32 2017-09-27 10:30:33 True
35 2013-01-16 11:50:53 2015-07-01 15:06:51 True
36 2012-05-12 23:14:08 2016-03-31 02:08:50 True
37 2013-02-12 21:21:58 2015-03-13 01:36:45 True
38 2012-08-12 05:07:33 2016-04-21 06:45:48 True
39 2012-09-08 20:40:18 2013-08-12 22:05:30 True
40 2010-10-20 02:05:23 2011-08-18 06:37:04 True
41 2010-09-27 07:44:21 2012-11-25 04:20:47 True
42 2014-09-04 22:27:01 2015-05-18 21:44:14 True
43 2012-01-06 15:35:33 2016-03-02 00:46:20 True
44 2013-09-06 11:39:50 2017-08-21 07:39:03 True
45 2011-01-01 00:58:45 2014-03-22 03:40:25 True
46 2015-07-03 04:02:56 2016-02-16 20:34:17 False
47 2010-06-20 15:24:54 2015-11-17 08:05:10 True
48 2011-06-24 02:39:13 2015-01-13 18:28:24 True
49 2011-09-30 06:53:27 2011-11-13 12:11:41 False
50 2015-10-11 03:13:35 2016-06-29 01:00:01 True
答案 0 :(得分:2)
我认为您正在寻找在某段时间之间检查没有提及任何年份的日期。您可以创建区间索引,然后检查年份是否在...(Reference)
之间# A custom function to return true of certain date is in between range
def get_date(x):
le = x.name.left.year
ri = x.name.right.year
# Create a date based on the interval years
rng = pd.to_datetime(['04-06-{}'.format(str(i)) for i in range(le-1,ri+1)])
# Return true if any of the above date is present in the interval
m = any([i in x.name for i in rng])
return m
df.index = pd.IntervalIndex.from_arrays(df['start_date'],df['end_date'])
df['new'] = df.apply(get_date,1)
输出:
print(df.reset_index(drop=True))
start_date end_date contains_april_6th new 0 2012-07-10 23:06:07 2014-12-30 15:07:09 True True 1 2011-10-21 13:56:20 2011-10-25 12:54:26 False False 2 2011-04-23 03:25:00 2013-12-10 23:08:02 True True 3 2014-08-10 01:14:03 2016-01-23 22:23:52 True True 4 2017-04-23 17:15:43 2017-06-03 14:02:19 False False 5 2012-10-01 01:42:00 2016-02-05 04:17:16 True True 6 2015-08-21 19:59:45 2015-09-06 03:58:59 False False 7 2016-06-10 23:03:14 2016-10-26 23:00:53 False False 8 2011-08-12 20:50:34 2013-09-22 23:03:05 True True 9 2012-11-04 00:45:24 2013-08-15 02:39:36 True True
我会选择@ jezreal的解决方案。在这里,我使用了apply,这在数据帧的长度增加时非常耗时。
答案 1 :(得分:2)
我认为list comprehension
创建的所有面具都需要np.logical_or and reduce
:
#get minimal and max years
stacked = df[['start_date','end_date']].stack()
min_date = stacked.min().year
max_date = stacked.max().year
print (min_date)
print (max_date)
2010
2017
#for each date create mask
masks = [(df['start_date'] < pd.Timestamp(x,4,6)) &
(df['end_date'] > pd.Timestamp(x,4,6)) for x in range(min_date, max_date + 1)]
#reduce all masks with OR
df['new'] = np.logical_or.reduce(masks)
print (df)
start_date end_date contains_april_6th new
0 2012-07-10 23:06:07 2014-12-30 15:07:09 True True
1 2011-10-21 13:56:20 2011-10-25 12:54:26 False False
2 2011-04-23 03:25:00 2013-12-10 23:08:02 True True
3 2014-08-10 01:14:03 2016-01-23 22:23:52 True True
4 2017-04-23 17:15:43 2017-06-03 14:02:19 False False
5 2012-10-01 01:42:00 2016-02-05 04:17:16 True True
6 2015-08-21 19:59:45 2015-09-06 03:58:59 False False
7 2016-06-10 23:03:14 2016-10-26 23:00:53 False False
8 2011-08-12 20:50:34 2013-09-22 23:03:05 True True
9 2012-11-04 00:45:24 2013-08-15 02:39:36 True True
10 2016-03-28 07:51:55 2017-10-07 13:07:57 True True
11 2010-08-03 06:42:04 2012-01-04 09:04:00 True True
12 2010-05-11 01:22:27 2012-12-30 05:51:48 True True
13 2011-12-30 01:19:28 2014-06-20 09:13:32 True True
14 2011-11-12 18:04:15 2013-01-05 19:47:20 True True
15 2014-04-13 10:40:25 2015-06-08 11:00:44 True True
16 2012-10-23 05:43:02 2014-11-04 11:53:35 True True
17 2011-03-18 05:00:50 2012-07-30 17:24:36 True True
18 2010-04-24 10:00:09 2011-02-21 23:26:50 False False
19 2012-02-09 08:05:07 2016-04-06 14:24:22 True True
20 2012-04-05 08:39:50 2014-03-11 01:22:43 True True
21 2011-07-30 19:19:32 2017-02-14 17:04:36 True True
22 2010-01-02 23:48:45 2012-12-16 14:42:24 True True
23 2011-11-08 23:49:27 2012-03-11 22:17:14 False False
24 2013-02-25 02:27:50 2014-08-02 00:09:07 True True
25 2010-07-11 18:56:22 2014-09-06 20:38:50 True True
26 2012-11-23 02:08:11 2013-02-26 18:51:47 False False
27 2011-05-31 09:02:02 2017-08-01 01:58:09 True True