我无法确定如何重新采样大熊猫日期时间索引数据帧,但需要最少数量的值才能提供值。我想将每日数据重新采样为每月,并且要求至少90%的值存在才能产生价值。
输入每日数据:
import pandas as pd
rng = pd.date_range('1/1/2011', periods=365, freq='D')
ts = pd.Series(pd.np.random.randn(len(rng)), index=rng)
ts['2011-01-01':'2011-01-05']=pd.np.nan #a short length of NANs to timeseries
ts['2011-10-03':'2011-10-30']=pd.np.nan #add ~ month long length of NANs to timeseries
一月份只有少数NAN,但十月份几乎整整一个月的NAN,我想要每月重新采样总和的输出:
ts.resample('M').sum()
为10月提供NAN(> 90%的每日数据丢失)和1月的值(<90%的数据丢失),而不是当前输出:
2011-01-31 11.949479
2011-02-28 -1.730698
2011-03-31 -0.141164
2011-04-30 -0.291702
2011-05-31 -1.996223
2011-06-30 -1.936878
2011-07-31 5.025407
2011-08-31 -1.344950
2011-09-30 -2.035502
2011-10-31 -2.571338
2011-11-30 -13.492956
2011-12-31 7.100770
我使用滚动平均值和min_periods阅读this post;我更喜欢继续使用resample进行直接时间索引。这可能吗?我无法在重新采样文档或堆栈溢出中找到太多来解决这个问题。
答案 0 :(得分:5)
使用resample
时获取非空值的总和和计数,然后根据需要使用非空计数来改变总和:
# resample getting a sum and non-null count
ts = ts.resample('M').agg(['sum', 'count'])
# determine invalid months
invalid = ts['count'] <= 0.1 * ts.index.days_in_month
# restrict to the sum and null out invalid entries
ts = ts['sum']
ts[invalid] = np.nan
或者,您可以编写一个自定义求和函数,在内部执行此过滤,但在大型数据集上可能效率不高:
def sum_valid_obs(x):
min_obs = 0.1 * x.index[0].days_in_month
valid_obs = x.notnull().sum()
if valid_obs < min_obs:
return np.nan
return x.sum()
ts = ts.resample('M').apply(sum_valid_obs)
任一方法的结果输出:
2011-01-31 3.574859
2011-02-28 2.907705
2011-03-31 -10.060877
2011-04-30 3.270250
2011-05-31 -3.492617
2011-06-30 -1.855461
2011-07-31 -7.363193
2011-08-31 0.128842
2011-09-30 -9.509890
2011-10-31 NaN
2011-11-30 0.543561
2011-12-31 3.354250
Freq: M, Name: sum, dtype: float64
答案 1 :(得分:1)
使用最新的熊猫版本(我会说从v0.22.0开始的文档),您可以只使用min_count
关键字参数:
import pandas as pd
rng = pd.date_range('1/1/2011', periods=365, freq='D')
ts = pd.Series(pd.np.random.randn(len(rng)), index=rng)
ts['2011-01-01':'2011-01-05'] = pd.np.nan #a short length of NANs to timeseries
ts['2011-10-03':'2011-10-30'] = pd.np.nan #add ~ month long length of NANs to timeseries
ts.resample('M').sum(min_count=20)
输出
2011-01-31 8.000269
2011-02-28 -6.648587
2011-03-31 10.593682
2011-04-30 -1.214945
2011-05-31 4.259289
2011-06-30 -5.986097
2011-07-31 -6.612820
2011-08-31 -1.073952
2011-09-30 -2.164976
2011-10-31 NaN
2011-11-30 1.912070
2011-12-31 12.101526
Freq: M, dtype: float64