Question

假设我希望通过线性插值将时间序列重新索引到预定义的索引，在该索引中，新旧索引之间均不共享任何索引值。例如

# index is all precise timestamps e.g. 2018-10-08 05:23:07
series = pandas.Series(data,index) 

# I want rounded date-times
desired_index = pandas.date_range("2010-10-08",periods=10,freq="30min")

教程/ API建议这样做的方法是先reindex，然后使用interpolate填充NaN值。但是，由于新旧索引之间没有日期时间重叠，因此重新索引会输出所有NaN：

# The following outputs all NaN as no date times match old to new index
series.reindex(desired_index)

我不想在reindex期间填充最接近的值，因为这将失去精度，因此我提出了以下建议；在插值之前将重新编制索引的序列与原始序列连接起来：

pandas.concat([series,series.reindex(desired_index)]).sort_index().interpolate(method="linear")

这似乎效率很低，将两个序列串联起来然后排序。有更好的方法吗？

Answer 1

我看到的唯一（简单）的方法是使用resample上采样到您的时间分辨率（例如1秒），然后重新索引。

获取示例DataFrame：

import numpy as np
import pandas as pd

np.random.seed(2)

df = (pd.DataFrame()
 .assign(SampleTime=pd.date_range(start='2018-10-01', end='2018-10-08', freq='30T')
                    + pd.to_timedelta(np.random.randint(-5, 5, size=337), unit='s'),
         Value=np.random.randn(337)
         )
 .set_index(['SampleTime'])
)

让我们看看数据是什么样的：

df.head()

                        Value
SampleTime
2018-10-01 00:00:03     0.033171
2018-10-01 00:30:03     0.481966
2018-10-01 01:00:01     -0.495496

获取所需的索引：

desired_index = pd.date_range('2018-10-01', periods=10, freq='30T')

现在，使用所需索引和现有索引的并集对数据重新编制索引，根据时间进行插值，然后仅使用所需索引重新编制索引：

(df
 .reindex(df.index.union(desired_index))
 .interpolate(method='time')
 .reindex(desired_index)
)

                        Value
2018-10-01 00:00:00     NaN
2018-10-01 00:30:00     0.481218
2018-10-01 01:00:00     -0.494952
2018-10-01 01:30:00     -0.103270

如您所见，第一个时间戳仍然存在问题，因为它不在原始索引的范围内。有很多方法可以解决此问题（例如，pad）。

Answer 2

我的方法

    frequency = nyse_trading_dates.rename_axis([None]).index
    
    df = prices.rename_axis([None]).reindex(frequency)

    for d in prices.rename_axis([None]).index:
        df.loc[d] = prices.loc[d]
        
    df.interpolate(method='linear')

方法二

    prices = data.loc[~data.index.duplicated(keep='last')]        
    #prices = data.reset_index()

    idx1 = prices.index  
    idx1 = pd.to_datetime(idx1, errors='coerce')

    merged = idx1.union(idx2)
    s = prices.reindex(merged)
    df = s.interpolate(method='linear').dropna(axis=0, how='any')

    data=df

熊猫重新索引并有效插值时间序列（重新索引会丢弃数据）

2 个答案: