Question

我试图将时间段的总和均匀分配给较高采样时间段的分量。

我做了什么：

>>> rng = pandas.PeriodIndex(start='2014-01-01', periods=2, freq='W')
>>> ts = pandas.Series([i+1 for i in range(len(rng))], index=rng)
>>> ts
2013-12-30/2014-01-05    1
2014-01-06/2014-01-12    2
Freq: W-SUN, dtype: float64

>>> ts.resample('D')
2013-12-30     1
2013-12-31   NaN
2014-01-01   NaN
2014-01-02   NaN
2014-01-03   NaN
2014-01-04   NaN
2014-01-05   NaN
2014-01-06     2
2014-01-07   NaN
2014-01-08   NaN
2014-01-09   NaN
2014-01-10   NaN
2014-01-11   NaN
2014-01-12   NaN
Freq: D, dtype: float64

我真正想要的是：

>>> ts.resample('D', some_miracle_thing)
2013-12-30     1/7
2013-12-31     1/7
2014-01-01     1/7
2014-01-02     1/7
2014-01-03     1/7
2014-01-04     1/7
2014-01-05     1/7
2014-01-06     2/7
2014-01-07     2/7
2014-01-08     2/7
2014-01-09     2/7
2014-01-10     2/7
2014-01-11     2/7
2014-01-12     2/7
Freq: D, dtype: float64

有办法吗

具体而言 - 例如，具有x/7 lambda函数？
通常情况下，它独立于因子7（比如24小时到几天等等）？

Answer 1

有点费解，但这有用吗？

首先，重新取样时，添加.groupby(level=0)以保留原始时间戳。（基于此answer）

rs = ts.groupby(level=0).resample('D')

然后在MultiIndex的第一级应用groupby以应用所需的操作。

In [285]: rs.groupby(level=0).transform(lambda x: x.iloc[0] / float(len(x)))
Out[285]: 
2013-12-30/2014-01-05  2013-12-30    0.142857
                       2013-12-31    0.142857
                       2014-01-01    0.142857
                       2014-01-02    0.142857
                       2014-01-03    0.142857
                       2014-01-04    0.142857
                       2014-01-05    0.142857
2014-01-06/2014-01-12  2014-01-06    0.285714
                       2014-01-07    0.285714
                       2014-01-08    0.285714
                       2014-01-09    0.285714
                       2014-01-10    0.285714
                       2014-01-11    0.285714
                       2014-01-12    0.285714
dtype: float64

Answer 2

这很有效，但我发现它很难看：

>>> rs = ts.resample('D', fill_method="pad")
>>> rs/7

2013-12-30    0.142857
2013-12-31    0.142857
2014-01-01    0.142857
2014-01-02    0.142857
2014-01-03    0.142857
2014-01-04    0.142857
2014-01-05    0.142857
2014-01-06    0.285714
2014-01-07    0.285714
2014-01-08    0.285714
2014-01-09    0.285714
2014-01-10    0.285714
2014-01-11    0.285714
2014-01-12    0.285714
Freq: D, dtype: float64

这个基本功能没有内部功能吗？

Answer 3

我讨厌这种解决方案，但是当您不确定新间隔的数量时，它可以用于上采样。从一周到一天很容易，通常是每周7天。但是我发现基于上采样的间隔数通常是未知的-此解决方案适用于此。

这个想法是将重采样后间隔的数量放入初始的预重采样数据帧中，然后进行重采样并将数据除以间隔计数。旁注-这是一个数据框，而不是序列。

# Create unique group IDs by simply using the existing index (Assumes an integer, non-duplicated index)
df['group'] = df.index  

# Get the count of intervals for each post-resampled timestamp.
df['count'] = df.set_index('timestamp').resample('15min').ffill()['group'].value_counts()

# Resample all data again and fill so that the count is now included in every row.
df          = df.set_index('timestamp').resample('15min').ffill()

# Apply the division on the entire dataframe and clean up.
df          = df.div(df['count'], axis = 0).reset_index().drop(['group','count'], axis = 1)

我将其包装在一个函数中并塞进去，这样我就不必再用类似的东西来查看它了：

def distribute_upsample(df, index, freq)

其中index为'timestamp'，freq为'15min'

将总和平均除以使用大熊猫进行上采样时的较高采样时间段

3 个答案: