这是一个具有线性回归的个人项目,我正在创建数据集以输入到回归算法中。
我正在使用的数据类似于以下最小示例:
# Get the element that has the overflow property
div = browser.div
# Scroll down a bit
browser.wd.action.send_keys(div.wd, :down).perform
browser.wd.action.send_keys(div.wd, :page_down).perform
# Scroll to the bottom
browser.wd.action.send_keys(div.wd, :end).perform
仅持续多年直到当前日期。
我需要使用四种不同的方法将这些数据收集为年度平均值:
方法1(PYD):
date avgwindsp precip temp_max temp_min
0 2006-01-01 3.1 16.5 11.7 6.1
1 2006-01-02 4.9 2.0 18.3 10.0
2 2006-01-03 1.7 0.0 15.6 5.0
3 2006-01-04 1.6 0.0 15.6 5.0
4 2006-01-05 1.7 0.0 20.6 4.4
5 2006-01-06 1.4 0.0 17.8 5.6
6 2006-01-07 1.7 0.0 15.6 5.0
方法2(FLH):
PYD:
4/20/17-4/20/18
4/20/18-4/20/19
4/20/19-4/20/20
Start point = Today
Period = # of days in year
Overlap = 0
Offset from start of year = Days to today
方法3:(BCY)
11/01/17-4/20/18
11/01/18-4/20/19
11/01/19-4/20/20
Start point = Specific Date
Period = # of days in year
Overlap = Days between specific Date and current date
Offset from start of year = Days between Specific Date and Start of Year
方法4(LCY):
1/1/18-4/20/18
1/1/19-4/20/19
1/1/20-4/20/20
Start point = Start of Year
Period = # of days in year
Overlap = Start of year - current date
Offset = None
我的第一个尝试是使用“重采样”功能将其分成每年一次,但是每年的重采样都是固定的(我为此工作)并且不支持重叠。
1/1/17-4/20/18
1/1/18-4/20/19
1/1/19-4/20/20
Start point = Today
Period = # of days in year
Overlap = Negative (Start of year - current date)
Offset = Days to today
基于搜索解决方案并遇到以下问题: Pandas resample with overlap
并生成了四个可能的Concat字符串:
今天的日子:
aggregate_methods = {
'temp_max': np.mean,
'temp_min': np.mean,
'precip': np.sum,
}
climate_data['date'] = pd.to_datetime(climate_data['date'], format='%Y-%m-%d')
# Get first data year
first_date = climate_data.iloc[1]['date']
last_date = climate_data.iloc[-1]['date']
first_harvest = first_date.replace(month=harvest_month, day=1)
# Calculate offsets. https://www.w3schools.com/python/python_datetime.asp
pyd_alias = 'A-' + last_date.strftime('%b').upper() # Short version of current month
flh_alias = 'AS-' + first_harvest.strftime('%b').upper() # Short version of harvest month
bcy_alias = 'AS' # no changes needed
lcy_alias = 'AS' # no changes needed
# Resample Data
pyd_climate_data = climate_data.resample(pyd_alias,
on='date',
).agg(aggregate_methods).reset_index()
flh_climate_data = climate_data.resample(flh_alias,
on='date',
label='right'
).agg(aggregate_methods).reset_index()
bcy_climate_data = climate_data.resample(bcy_alias,
on='date',
).agg(aggregate_methods).reset_index()
lcy_climate_data = climate_data.resample(lcy_alias,
on='date',
).agg(aggregate_methods).reset_index()
# LCY Partial Data Merge
lcy_agg_last_year = lcy_climate_data.iloc[[-2, -1]].agg(aggregate_methods)
lcy_climate_data.iloc[-1, 1:] = lcy_agg_last_year
此操作失败,因为它被抛出:
new_year_day = pd.Timestamp(year=date.year, month=1, day=1)
specific_date = pd.Timestamp(year=date.year, month=11, day=1
day_of_the_year = (date - new_year_day).days + 1
# PYD
N = 365 # I'll figure out a way to handle leap years later...
ov = day_of_the_year
pd.concat([df.shift(i).iloc[::N] for i in range(0, N)], axis=1).agg(aggregate_methods).reset_index()
# FLH
ov = day_to_the_date
pd.concat([df.shift(i).iloc[::N] for i in range(0, N+ov)], axis=1).agg(aggregate_methods).reset_index()
# BCY
ov = day_of_the_year
pd.concat([df.shift(i).iloc[::N] for i in range(0, N+ov)], axis=1).agg(aggregate_methods).reset_index()
# LCY
ov = day_of_the_year
pd.concat([df.shift(i).iloc[::N] for i in range(-ov, N, -1)], axis=1).agg(aggregate_methods).reset_index()
我真的不明白为什么切片不起作用。 无论如何,我也看不到可以抵消范围开始/结束的值。
我调查了pandas.DataFrame.rolling,并试图跟随tutorial on it,但该窗口是连续的,其频率为每天一次。