Question

我有一个数据集，其中每个观察值都有一个Date。然后，我有一个事件列表。我想过滤数据集并仅在日期在事件的+/- 30天内保持观察。另外，我想知道它最接近哪个事件。

例如，主要数据集如下：

Product Date
Chicken 2008-09-08
Pork    2008-08-22
Beef    2008-08-15
Rice    2008-07-22
Coke    2008-04-05
Cereal  2008-04-03
Apple   2008-04-02
Banana  2008-04-01

它是由

生成的

d = {'Product': ['Apple', 'Banana', 'Cereal', 'Coke', 'Rice', 'Beef', 'Pork', 'Chicken'],
     'Date': ['2008-04-02', '2008-04-01', '2008-04-03', '2008-04-05',
              '2008-07-22', '2008-08-15', '2008-08-22', '2008-09-08']}

df = pd.DataFrame(data = d)

df['Date'] = pd.to_datetime(df['Date'])

然后我有一列事件：

Date
2008-05-03
2008-07-20
2008-09-01

由

生成

event = pd.DataFrame({'Date': pd.to_datetime(['2008-05-03', '2008-07-20', '2008-09-01'])})

目标（已编辑）

仅当df在df['Date']一个月内时，我才希望将行保留在event['Date']中。例如，第一个事件发生在2008年5月3日，因此我想将观察结果保留在2008年4月3日至2008年6月3日之间，并创建一个新列以告诉该观察结果最接近2008年的事件-05-03。

Product Date        Event
Chicken 2008-09-08  2008-09-01
Pork    2008-08-22  2008-09-01
Beef    2008-08-15  2008-07-20
Rice    2008-07-22  2008-07-20
Coke    2008-04-05  2008-05-03
Cereal  2008-04-03  2008-05-03

Answer 1

在30天内使用numpy广播进行广播

df[np.any(np.abs(df.Date.values[:,None]-event.Date.values)/np.timedelta64(1,'D')<31,1)]
Out[90]: 
   Product       Date
0  Chicken 2008-09-08
1     Pork 2008-08-22
2     Beef 2008-08-15
3     Rice 2008-07-22
4     Coke 2008-04-05
5   Cereal 2008-04-03

Answer 2

event['eDate'] = event.Date    
df = pd.merge_asof(df.sort_values('Date'), event.sort_values('Date'), on="Date", direction='nearest')
df[(df.Date - df.eDate).abs() <= '30 days']

Answer 3

我会将listcomp与intervalindex

一起使用

ms = pd.offsets.MonthOffset(1)
e1 = event.Date - ms
e2 = event.Date + ms
iix = pd.IntervalIndex.from_arrays(e1, e2, closed='both')
df.loc[[any(d in i for i in iix) for d in df.Date]]

Out[93]:
   Product       Date
2   Cereal 2008-04-03
3     Coke 2008-04-05
4     Rice 2008-07-22
5     Beef 2008-08-15
6     Pork 2008-08-22
7  Chicken 2008-09-08

Answer 4

如果只是几个月而不论日期如何，这可能会有用。

rng=[]
for a, b in zip (event['Date'].dt.month-1, event['Date'].dt.month+1):
    rng = rng + list(range(a-1,b+1,1))
df[df['Date'].dt.month.isin(set(rng))]

过滤日期在多个给定日期的+/- 30天内的数据

4 个答案: