Python:Pandas:滚动Windows - mean()有效,但variance()不起作用?

时间:2016-06-16 00:36:53

标签: python pandas dataframe time-series nan

我有以下数据记录在几秒钟内:http://pastebin.com/wBSJWYn2

我想以1分钟的间隔捕捉各种夏季统计数据,如均值,方差等。所以我在sensor_data.rolling(window=1,freq="1MIN")上运行这些功能。在大多数情况下它工作正常,但是对于某些类型的功能我可以克服两种类型的不规则性。具体来说,要么:

  1. 没有输出不完整的分钟 - 它没有给出输出的分钟数不会超过60秒。这是mean(), quantile(), sum()
  2. 的情况
  3. 根本没有输出。对于某些功能,例如var(), std(), kurt(), skew(),我根本不会获得任何值。我真的无法理解为什么会出现这种情况,因为它能够计算平均值......
  4. 其他功能似乎没有问题:max(), median(), min()

    我真的非常关心第二个问题,但是为第一个问题找到解决办法也是一个好处......

    sensor_data.head()
    
        x_acceleration  y_acceleration  z_acceleration  heart_rate  electrodermal_activity  temperature
    index                       
    2016-05-16 06:58:44 -33.25000   -43.03125   33.09375    NaN 0.297099    33.33
    2016-05-16 06:58:45 -28.15625   -52.90625   24.12500    NaN 0.219612    33.33
    2016-05-16 06:58:46 -25.87500   -55.96875   21.18750    NaN 0.222648    33.33
    2016-05-16 06:58:47 -24.00000   -57.46875   19.40625    NaN 0.217335    33.33
    2016-05-16 06:58:48 -22.84375   -56.25000   23.40625    NaN 0.214300    33.33
    

    第一种情况的输出示例 - 不完整分钟的输出:

    sensor_data.rolling(window=1,freq="1MIN").mean().head()
        x_acceleration  y_acceleration  z_acceleration  heart_rate  electrodermal_activity  temperature
    index                       
    2016-05-16 06:58:00 NaN NaN NaN NaN NaN NaN
    2016-05-16 06:59:00 -24.84375   -59.46875   9.03125 68.57   0.208988    33.75
    2016-05-16 07:00:00 6.31250 -62.78125   6.46875 79.40   0.224924    33.84
    2016-05-16 07:01:00 -21.18750   -57.00000   22.50000    92.00   0.224165    34.13
    2016-05-16 07:02:00 -17.46875   -58.87500   21.84375    81.10   0.224165    34.25
    

    第二种情况的输出示例 - 无输出:

    sensor_data.rolling(window=1,freq="1MIN").var().head()
    
        x_acceleration  y_acceleration  z_acceleration  heart_rate  electrodermal_activity  temperature
    index                       
    2016-05-16 06:58:00 NaN NaN NaN NaN NaN NaN
    2016-05-16 06:59:00 NaN NaN NaN NaN NaN NaN
    2016-05-16 07:00:00 NaN NaN NaN NaN NaN NaN
    2016-05-16 07:01:00 NaN NaN NaN NaN NaN NaN
    2016-05-16 07:02:00 NaN NaN NaN NaN NaN NaN
    

1 个答案:

答案 0 :(得分:1)

对于初学者来说,这将让你前进。

sensor_data.groupby(pd.Grouper(level=0, freq='Min')).describe()

你可以建立一个自定义功能:

def stats(df):
    kurt = pd.DataFrame(df.kurt(), columns=['kurt']).T
    skew = pd.DataFrame(df.skew(), columns=['skew']).T
    var = pd.DataFrame(df.var(), columns=['var']).T
    return pd.concat([df.describe(), var, skew, kurt])

然后:

sensor_data.groupby(pd.Grouper(level=0, freq='Min')).apply(stats)

enter image description here

编辑:

注册@ Jeff的评论:

funcs = {
    'Count': 'count',
    'Var': np.var,
    'Std': np.std,
    'Mean': np.mean,
    'Min': np.min,
    '25%': lambda x: x.quantile(.25),
    '50%': np.median,
    '75%': lambda x: x.quantile(.75),
    'Max': np.max,
    'Skew': 'skew',
    'Kurt': lambda x: x.kurt(),
}

cols = sensor_data.columns

这是一个全面的功能列表。

sensor_data.groupby(pd.Grouper(level=0, freq='Min')).agg({c: funcs for c in cols}).stack()

看起来像:

enter image description here

时序

%%timeit
sensor_data.groupby(pd.Grouper(level=0, freq='Min')).agg({c: funcs for c in cols}).stack()

10 loops, best of 3: 121 ms per loop

%%timeit
sensor_data.groupby(pd.Grouper(level=0, freq='Min')).apply(stats).dropna()

1 loop, best of 3: 221 ms per loop

看起来agg的速度快了两倍。