Question

在Python 3.5中，Pandas 20，说我有一年的定期时间系列：

import pandas as pd
import numpy as np

start_date = pd.to_datetime("2015-01-01T01:00:00.000Z", infer_datetime_format=True)
end_date = pd.to_datetime("2015-12-31T23:00:00.000Z", infer_datetime_format=True)
index = pd.DatetimeIndex(start=start_date,
                         freq="60min",
                         end=end_date)
time = np.array((index - start_date)/ np.timedelta64(1, 'h'), dtype=int)
df = pd.DataFrame(index=index)
df["foo"] = np.sin( 2 * np.pi * time / len(time))

df.plot()

我想为新索引定期推断时间序列。 I.e with：

new_start_date = pd.to_datetime("2017-01-01T01:00:00.000Z", infer_datetime_format=True)
new_end_date = pd.to_datetime("2019-12-31T23:00:00.000Z", infer_datetime_format=True)
new_index = pd.DatetimeIndex(start=new_start_date,
                             freq="60min",
                             end=new_end_date)

我想使用某种extrapolate_periodic方法来获取：

# DO NOT RUN
new_df = df.extrapolate_periodic(index=new_index)
# END DO NOT RUN

new_df.plot()

在熊猫中做这样的事情最好的方法是什么？

如何定义周期性并轻松从新索引中获取数据？

Answer 1

我认为我有你想要的东西，虽然它不是一个简单的熊猫方法。

直接从你离开的地方继续，

def extrapolate_periodic(df, new_index):
    df_right = df.groupby([df.index.dayofyear, df.index.hour]).mean()
    df_left = pd.DataFrame({'new_index': new_index}).set_index('new_index')
    df_left = df_left.assign(dayofyear=lambda x: x.index.dayofyear,
                             hour=lambda x: x.index.hour)
    df = (pd.merge(df_left, df_right, left_on=['dayofyear', 'hour'],
                   right_index=True, suffixes=('', '_y'))
            .drop(['dayofyear', 'hour'], axis=1))
    return df.sort_index()

new_df = extrapolate_periodic(df, new_index)
# or as a method style
# new_df = df.pipe(extrapolate_periodic, new_index)

new_df.plot()

如果你有超过一年的数据，它将取每个重复的日间小时的平均值。如果您只想要最近的阅读，mean可以更改为last。

如果您没有足够数年的数据，这将无效，但您可以通过添加reindex来完成年份，然后使用带有多项式功能的插值来填补缺失的foo来解决此问题。列。

Answer 2

以下是我用来解决问题的一些代码。假设初始系列对应于一段数据。

def extrapolate_periodic(df, new_index):
    index = df.index
    start_date = np.min(index)
    end_date = np.max(index)
    period = np.array((end_date - start_date) / np.timedelta64(1, 'h'), dtype=int)
    time = np.array((new_index - start_date)/ np.timedelta64(1, 'h'), dtype=int)
    new_df = pd.DataFrame(index=new_index)
    for col in list(df.columns):
        new_df[col] = np.array(df[col].iloc[time % period])
    return new_df

如何推断熊猫的周期性时间系列？

2 个答案: