Question

我的数据框的格式为：

df = pd.DataFrame({'Date':['2017-01-01', '2017-02-13', '2018-03-01', '2018-04-01'], 'Value':[1,2,3,4]})

对于每一年，我都有一个不同的日期范围（例如，2017年从2017年2月2日到2017年2月15日以及2018年从2018年3月3日到2018年4月4日）存储为字典。

dates_dict = {2017: ('2017-02-02', '2017-02-15'), 2018: ('2018-03-03', '2018-04-04')}

我要创建的是数据框中的新列，如果“日期”在该年份的日期范围内，则为True，否则为False。对于给定的示例，输出为：

df =    Date        Value  in_range
     0  2017-01-01  1      False
     1  2017-02-13  2      True
     2  2018-03-01  3      False
     3  2018-04-01  4      True

我当前的解决方案是：

temp = []
for name, group in df.groupby(df['Date'].dt.year):
    temp.append((group['Date'] >= dates_dict[name][0]) & (group['Date'] <= 
    dates_dict[name][1]))
in_range = pd.concat(temp)
in_range = in_range.rename('in_range')
df = df.merge(in_range.to_frame(), left_index=True, right_index=True)

这行得通，但是我敢肯定有一种更简洁的方法可以实现这一目标。一般来说，有更好的方法来检查日期是否在较大的日期范围列表中？

Answer 1

设置

通过将字典转换为实际包含pd.date_range，可以提高解决方案的效率。这两种解决方案都假定您进行了此转换：

dates_dict = {k: pd.date_range(s, e) for k, (s, e) in dates_dict.items()}

选项1
将 apply 与字典查找配合使用：

df.Date.apply(lambda x: x in dates_dict[x.year], 1)

0    False
1     True
2    False
3     True
Name: Date, dtype: bool

选项2
使用列表推导功能的性能更高的选项：

df['in_range'] = [i in dates_dict[i.year] for i in df.Date]

        Date  Value  in_range
0 2017-01-01      1     False
1 2017-02-13      2      True
2 2018-03-01      3     False
3 2018-04-01      4      True

时间

In [208]: %timeit df.Date.apply(lambda x: x in dates_dict[x.year], 1)
289 ms ± 5.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [209]: %timeit [i in dates_dict[i.year] for i in df.Date]
284 ms ± 6.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Answer 2

您可以使用Warning: Error in load.wave: unsupported smaple width: 24 bits 73: load.wave创建一个系列map，并使用字典中每个ser的值，然后使用Date，例如：

between

您会得到：

ser = df.Date.dt.year.map(dates_dict)
df['in_range'] = df.Date.between(pd.to_datetime(ser.str[0]), pd.to_datetime(ser.str[1]))

使用每年的唯一日期范围在数据框中创建新列

2 个答案: