大熊猫将每日数据重新采样为具有重叠和偏移的年度数据?

时间:2020-09-03 17:53:03

标签: python pandas linear-regression

问题描述

这是一个具有线性回归的个人项目,我正在创建数据集以输入到回归算法中。

我正在使用的数据类似于以下最小示例:

# Get the element that has the overflow property
div = browser.div

# Scroll down a bit
browser.wd.action.send_keys(div.wd, :down).perform  
browser.wd.action.send_keys(div.wd, :page_down).perform  

# Scroll to the bottom
browser.wd.action.send_keys(div.wd, :end).perform  

仅持续多年直到当前日期。

我需要使用四种不同的方法将这些数据收集为年度平均值:

方法1(PYD):

         date  avgwindsp  precip  temp_max  temp_min
0  2006-01-01        3.1    16.5      11.7       6.1
1  2006-01-02        4.9     2.0      18.3      10.0
2  2006-01-03        1.7     0.0      15.6       5.0
3  2006-01-04        1.6     0.0      15.6       5.0
4  2006-01-05        1.7     0.0      20.6       4.4
5  2006-01-06        1.4     0.0      17.8       5.6
6  2006-01-07        1.7     0.0      15.6       5.0

方法2(FLH):

PYD:


    4/20/17-4/20/18
    4/20/18-4/20/19
    4/20/19-4/20/20


Start point = Today
Period = # of days in year 
Overlap = 0
Offset from start of year = Days to today

方法3:(BCY)

    11/01/17-4/20/18
    11/01/18-4/20/19
    11/01/19-4/20/20


Start point = Specific Date
Period = # of days in year 
Overlap = Days between specific Date and current date
Offset from start of year = Days between Specific Date and Start of Year

方法4(LCY):

    1/1/18-4/20/18
    1/1/19-4/20/19
    1/1/20-4/20/20


Start point = Start of Year
Period = # of days in year 
Overlap = Start of year - current date
Offset = None

我已经尝试过的东西

重新采样

我的第一个尝试是使用“重采样”功能将其分成每年一次,但是每年的重采样都是固定的(我为此工作)并且不支持重叠。

    1/1/17-4/20/18
    1/1/18-4/20/19
    1/1/19-4/20/20


Start point = Today
Period = # of days in year 
Overlap = Negative (Start of year - current date)
Offset = Days to today

Concat和Shift

基于搜索解决方案并遇到以下问题: Pandas resample with overlap

并生成了四个可能的Concat字符串:

今天的日子:

aggregate_methods = {
    'temp_max': np.mean,
    'temp_min': np.mean,
    'precip': np.sum,
}

climate_data['date'] = pd.to_datetime(climate_data['date'], format='%Y-%m-%d')

# Get first data year
first_date = climate_data.iloc[1]['date']
last_date = climate_data.iloc[-1]['date']
first_harvest = first_date.replace(month=harvest_month, day=1)

# Calculate offsets. https://www.w3schools.com/python/python_datetime.asp
pyd_alias = 'A-' + last_date.strftime('%b').upper()  # Short version of current month
flh_alias = 'AS-' + first_harvest.strftime('%b').upper()  # Short version of harvest month
bcy_alias = 'AS'  # no changes needed
lcy_alias = 'AS'  # no changes needed

# Resample Data
pyd_climate_data = climate_data.resample(pyd_alias,
                                         on='date',
                                         ).agg(aggregate_methods).reset_index()
flh_climate_data = climate_data.resample(flh_alias,
                                         on='date',
                                         label='right'
                                         ).agg(aggregate_methods).reset_index()
bcy_climate_data = climate_data.resample(bcy_alias,
                                         on='date',
                                         ).agg(aggregate_methods).reset_index()
lcy_climate_data = climate_data.resample(lcy_alias,
                                         on='date',
                                         ).agg(aggregate_methods).reset_index()

# LCY Partial Data Merge
lcy_agg_last_year = lcy_climate_data.iloc[[-2, -1]].agg(aggregate_methods)
lcy_climate_data.iloc[-1, 1:] = lcy_agg_last_year

此操作失败,因为它被抛出:

new_year_day = pd.Timestamp(year=date.year, month=1, day=1)
specific_date = pd.Timestamp(year=date.year, month=11, day=1
day_of_the_year = (date - new_year_day).days + 1


# PYD
N = 365 # I'll figure out a way to handle leap years later...
ov = day_of_the_year
pd.concat([df.shift(i).iloc[::N] for i in range(0, N)], axis=1).agg(aggregate_methods).reset_index()

# FLH
ov = day_to_the_date
pd.concat([df.shift(i).iloc[::N] for i in range(0, N+ov)], axis=1).agg(aggregate_methods).reset_index()

# BCY
ov = day_of_the_year
pd.concat([df.shift(i).iloc[::N] for i in range(0, N+ov)], axis=1).agg(aggregate_methods).reset_index()

# LCY
ov = day_of_the_year
pd.concat([df.shift(i).iloc[::N] for i in range(-ov, N, -1)], axis=1).agg(aggregate_methods).reset_index()

我真的不明白为什么切片不起作用。 无论如何,我也看不到可以抵消范围开始/结束的值。

滚动窗口

我调查了pandas.DataFrame.rolling,并试图跟随tutorial on it,但该窗口是连续的,其频率为每天一次。

0 个答案:

没有答案