具有此时间序列:
>>> from pandas import date_range
>>> from pandas import Series
>>> dates = date_range('2019-01-01', '2019-01-10', freq='D')[[0, 4, 5, 8]]
>>> dates
DatetimeIndex(['2019-01-01', '2019-01-05', '2019-01-06', '2019-01-09'], dtype='datetime64[ns]', freq=None)
>>> series = Series(index=dates, data=[0., 1., 2., 3.])
>>> series
2019-01-01 0.0
2019-01-05 1.0
2019-01-06 2.0
2019-01-09 3.0
dtype: int64
我可以用熊猫重新采样到'2D'
并得到:
series.resample('2D').sum()
2019-01-01 0.0
2019-01-03 0.0
2019-01-05 3.0
2019-01-07 0.0
2019-01-09 3.0
Freq: 2D, dtype: int64
但是,我想得到:
2019-01-01 0.0
2019-01-05 3.0
2019-01-09 3.0
Freq: 2D, dtype: int64
或者至少(以便我可以放下NaN
s):
2019-01-01 0.0
2019-01-03 Nan
2019-01-05 3.0
2019-01-07 Nan
2019-01-09 3.0
Freq: 2D, dtype: int64
'2D'
语法(或'W'
或'3H'
或其他...),并让熊猫关心分组/重采样这看起来很脏而且效率低下。希望有人提出更好的选择。 :-D
>>> resampled = series.resample('2D')
>>> (resampled.mean() * resampled.count()).dropna()
2019-01-01 0.0
2019-01-05 3.0
2019-01-09 3.0
dtype: float64
答案 0 :(得分:1)
使用像这样的总和之后,更清楚地使用resampled.count()
作为条件:
resampled = series.resample('2D')
resampled.sum()[resampled.count() != 0]
出:
2019-01-01 0.0
2019-01-05 3.0
2019-01-09 3.0
dtype: float64
在我的计算机上,此方法快22%(5.52ms与7.15ms)。
答案 1 :(得分:1)
您可以使用命名参数min_count
:
>>> series.resample('2D').sum(min_count=1).dropna()
2019-01-01 0.0
2019-01-05 3.0
2019-01-09 3.0
与其他方法的性能比较,从较快到较慢(运行您自己的测试,因为这可能取决于您的体系结构,平台,环境...):
In [38]: %timeit resampled.sum(min_count=1).dropna()
588 µs ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [39]: %timeit (resampled.mean() * resampled.count()).dropna()
622 µs ± 3.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [40]: %timeit resampled.sum()[resampled.count() != 0].copy()
960 µs ± 1.64 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)