我正尝试将我的数据集修改为按月日期进行组织,以便稍后进行预测。我遇到的问题是,我按时间顺序(一月,二月等)按月组织了一次,但我希望从当前日期起每隔30天组织一次。最后,我希望我的代码能再使用5个最近的30天。
我的数据集如下:
data1 = pd.DataFrame({'Id' : ['001','001','001','001','001','001','001','001','001',
'002','002','002','002','002','002','002','002','002',],
'Date': ['2020-01-12', '2019-12-30', '2019-12-01','2019-11-01', '2019-08-04', '2019-08-04', '2019-08-01', '2019-07-20', '2019-06-04',
'2020-01-11', '2019-12-12', '2019-12-01','2019-12-01', '2019-09-10', '2019-08-10', '2019-08-01', '2019-06-20', '2019-06-01'],
'Quantity' :[18,5,6,8,12,14,16,19,20, 21,7,6,5,4,3,2,1,0]
})
我的代码如下:
data1['Date'] =pd.to_datetime(data1['Date'])
data1 = data1.groupby('Id').apply(lambda x: x.set_index('Date').resample('M').sum())
data1 = data1.groupby(level='Id').tail(5)
预期输出类似于(带有groupby(Id))
Id Date Quantity
0 001 2020-02-04 18
1 001 2020-01-05 5
2 001 2019-12-06 6
3 001 2019-11-07 8
4 001 2019-11-08 12
5 002 2020-02-04 21
6 002 2020-01-05 7
7 002 2019-12-06 11
8 002 2019-11-07 0
9 002 2019-11-08 3
目前,这并没有任何实际意义,因为如果我要预测下个月的需求(比如说三月),实际上距今天已经有2个月了,尽管三月已经过去了一个月。
我希望我的问题很清楚,我花了很多时间试图弄清楚,我需要一些帮助。如果有人有暗示,我将非常感谢!
答案 0 :(得分:1)
您可以使用pd.cut
对从今天开始过去30天的时间段进行分组。
import pandas as pd
today = pd.to_datetime('today').normalize()
freq = '30D' # Size of the bins
Nbin = (today - data1['Date'].min())//pd.Timedelta(freq) + 1 # Number of bins
bins = [today - n*pd.Timedelta(freq) for n in range(Nbin, -1, -1)]
data1.groupby(['Id', pd.cut(data1['Date'], bins=bins)]).sum()
Id Date
001 (2019-06-09, 2019-07-09] NaN
(2019-07-09, 2019-08-08] 61.0
(2019-08-08, 2019-09-07] NaN
(2019-09-07, 2019-10-07] NaN
(2019-10-07, 2019-11-06] 8.0
(2019-11-06, 2019-12-06] 6.0
(2019-12-06, 2020-01-05] 5.0
(2020-01-05, 2020-02-04] 18.0
002 (2019-06-09, 2019-07-09] 1.0
(2019-07-09, 2019-08-08] 2.0
(2019-08-08, 2019-09-07] 3.0
(2019-09-07, 2019-10-07] 4.0
(2019-10-07, 2019-11-06] NaN
(2019-11-06, 2019-12-06] 11.0
(2019-12-06, 2020-01-05] 7.0
(2020-01-05, 2020-02-04] 21.0
答案 1 :(得分:1)
您可以使用pandas.Series.dt.days将日期转换为自今天以来的天数:
import numpy as np
import pandas as pd
today = pd.to_datetime('2019-05-13')
data1 = pd.DataFrame({'Id' : ['001','001','001','001','001','001','001','001','001',
'002','002','002','002','002','002','002','002','002',],
'Date': ['2020-01-12', '2019-12-30', '2019-12-01','2019-11-01', '2019-08-04', '2019-08-04', '2019-08-01', '2019-07-20', '2019-06-04',
'2020-01-11', '2019-12-12', '2019-12-01','2019-12-01', '2019-09-10', '2019-08-10', '2019-08-01', '2019-06-20', '2019-06-01'],
'Quantity' :[18,5,6,8,12,14,16,19,20, 21,7,6,5,4,3,2,1,0]
})
data1['Period from Today'] = (pd.to_datetime(data1['Date'])-today).dt.days // 30
data1 = data1.groupby(['Id', 'Period from Today'])
for key,group in data1:
print(group)
Id Date Quantity Period from Today
8 001 2019-06-04 20 0
Id Date Quantity Period from Today
4 001 2019-08-04 12 2
5 001 2019-08-04 14 2
6 001 2019-08-01 16 2
7 001 2019-07-20 19 2
Id Date Quantity Period from Today
3 001 2019-11-01 8 5
Id Date Quantity Period from Today
2 001 2019-12-01 6 6
Id Date Quantity Period from Today
1 001 2019-12-30 5 7
Id Date Quantity Period from Today
0 001 2020-01-12 18 8
Id Date Quantity Period from Today
17 002 2019-06-01 0 0
Id Date Quantity Period from Today
16 002 2019-06-20 1 1
Id Date Quantity Period from Today
14 002 2019-08-10 3 2
15 002 2019-08-01 2 2
Id Date Quantity Period from Today
13 002 2019-09-10 4 4
Id Date Quantity Period from Today
11 002 2019-12-01 6 6
12 002 2019-12-01 5 6
Id Date Quantity Period from Today
10 002 2019-12-12 7 7
Id Date Quantity Period from Today
9 002 2020-01-11 21 8
我不清楚您希望如何组织数据,但希望能对您有所帮助。