熊猫 - 重新取样和标准偏差

时间:2014-01-31 12:24:56

标签: python pandas time-series resampling

我有这个数据框:

startTime     endTime  emails_received
index                                             
2014-01-24 14:00:00  1390568400  1390569600    684
2014-01-24 14:00:00  1390568400  1390569300    700
2014-01-24 14:05:00  1390568700  1390569300    438
2014-01-24 14:05:00  1390568700  1390569900    586
2014-01-24 16:00:00  1390575600  1390576500    752
2014-01-24 16:00:00  1390575600  1390576500    743
2014-01-24 16:00:00  1390575600  1390576500    672
2014-01-24 16:00:00  1390575600  1390576200    712
2014-01-24 16:00:00  1390575600  1390576800    708

我运行resample(“10min”,how =“median”)。dropna()然后我得到:

                  startTime     endTime  emails_received
start                                             
2014-01-24 14:00:00  1390568550  1390569450    635
2014-01-24 16:00:00  1390575600  1390576500    712

这是正确的。有没有什么方法可以通过熊猫轻松获得平均值的标准偏差?

1 个答案:

答案 0 :(得分:7)

您只需要在DataFrame上调用.std()即可。这是一个说明性的例子。

创建DatetimeIndex

In [38]: index = pd.DatetimeIndex(start='2000-1-1',freq='1T', periods=1000)

创建一个包含2列的DataFrame

In [45]: df = pd.DataFrame({'a':range(1000), 'b':range(1000,3000,2)}, index=index)

DataFrame的头部,标准和平均值

In [47]: df.head()
Out[47]: 
                     a     b
2000-01-01 00:00:00  0  1000
2000-01-01 00:01:00  1  1002
2000-01-01 00:02:00  2  1004
2000-01-01 00:03:00  3  1006
2000-01-01 00:04:00  4  1008

In [48]: df.std()
Out[48]: 
a    288.819436
b    577.638872
dtype: float64

In [49]: df.mean()
Out[49]: 
a     499.5
b    1999.0
dtype: float64

下采样并执行相同的统计分数计算

In [54]: df = df.resample(rule="10T",how="median")

In [55]: df
Out[55]: 

DatetimeIndex: 100 entries, 2000-01-01 00:00:00 to 2000-01-01 16:30:00
Freq: 10T
Data columns (total 2 columns):
a    100  non-null values
b    100  non-null values
dtypes: float64(1), int64(1)

In [56]: df.head()
Out[56]: 
                        a     b
2000-01-01 00:00:00   4.5  1009
2000-01-01 00:10:00  14.5  1029
2000-01-01 00:20:00  24.5  1049
2000-01-01 00:30:00  34.5  1069
2000-01-01 00:40:00  44.5  1089

In [57]: df.std()
Out[57]: 
a    290.11492
b    580.22984
dtype: float64

In [58]: df.mean()
Out[58]: 
a     499.5
b    1999.0
dtype: float64

std()

下采样
In [62]: df2 = df.resample(rule="10T", how=np.std)

In [63]: df2
Out[63]: 

DatetimeIndex: 100 entries, 2000-01-01 00:00:00 to 2000-01-01 16:30:00
Freq: 10T
Data columns (total 2 columns):
a    100  non-null values
b    100  non-null values
dtypes: float64(2)

In [64]: df2.head()
Out[64]: 
                           a         b
2000-01-01 00:00:00  3.02765  6.055301
2000-01-01 00:10:00  3.02765  6.055301
2000-01-01 00:20:00  3.02765  6.055301
2000-01-01 00:30:00  3.02765  6.055301
2000-01-01 00:40:00  3.02765  6.055301

以下是.std()方法的文档字符串中的信息。

Return standard deviation over requested axis.
NA/null values are excluded

Parameters
----------
axis : {0, 1}
    0 for row-wise, 1 for column-wise
skipna : boolean, default True
    Exclude NA/null values. If an entire row/column is NA, the result
    will be NA
level : int, default None
    If the axis is a MultiIndex (hierarchical), count along a
    particular level, collapsing into a DataFrame

Returns
-------
std : Series (or DataFrame if level specified)

        Normalized by N-1 (unbiased estimator).