根据所选窗口的数据帧聚合

时间:2018-07-18 20:30:00

标签: python pandas dataframe

我正在将python与panda数据框一起使用。 我有一个从CSV文件导入的数据框。

         volume  temperature(c)
time(sec)
1000.1  10.4   26.5
1000.2  12.5   30.2
1000.3  13.2   40.5
.
.
.
8000.1  78   50.8
8000.2  79   51.5

我想创建一个新的数据帧,我们定义一个时间窗口W(例如5秒),并且每W秒将使用特定窗口上的不同计算将每一列的值聚合到一行,例如,平均值,标准z分数等。 输出数据帧的示例:

time(sec) mean_volume mean_temperature std_volume
1000.1  12.0.  32.4 1.4
1005.1  12.5   30.2 1.7
1010.1  11.7   30.1 1.5
.
.
.

我熟悉df['new col'] = data['source'].rolling(W).mean(),这不是我的解决方案 我附上示例

    T,H,L,C,label
1000.1,23.18,27.272,426,1
1000.2,23.15,27.2675,429.5,1
1000.3,23.15,27.245,426,1
1000.4,23.15,27.2,426,1
1000.5,23.1,27.2,426,1
1000.6,23.1,27.2,419,1
1000.7,23.1,27.2,419,1
1000.8,23.1,27.2,419,1
1000.9,23.1,27.2,419,1
1001,23.075,27.175,419,1
1001.1,23.075,27.15,419,1
1001.2,23.1,27.1,419,1
1001.3,23.1,27.16666667,419,1
1001.4,23.05,27.15,419,1
1001.5,23,27.125,419,1
1001.6,23,27.125,418.5,1
1001.7,23,27.2,0,0
1001.8,22.945,27.29,0,0
1001.9,22.945,27.39,0,0
1002,22.89,27.39,0,0
1002.1,22.89,27.39,0,0
1002.2,22.89,27.39,0,0
1002.3,22.89,27.445,0,0

对于上述示例,我希望新的数据帧将包含以下列:H_mean,H_std,L_mean,C_mean,L_std,C_std

此外,我如何在每个段(例如z得分)上应用自定义功能。

谢谢

1 个答案:

答案 0 :(得分:2)

鉴于您的数据位于名为pd.DataFrame的{​​{1}}中,以下方法可以解决问题:

df

我们正在使用pd.cut创建一个import pandas as pd import numpy as np step = 5 df.groupby(pd.cut(df.index, np.arange(start=df.index.min(), stop=df.index.max(), step=step, dtype=float)))\ .agg({'volume':['mean', 'std'], 'temperature':['mean']}) 的{​​{1}}。最后,我们使用IntervalIndex计算每个组的摘要统计信息; groupby列为pd.DataFrame.aggmeanstd列仅为volume

我还没有测试过,但是如果您提供minimal, complete and verifiable example,我可以做到。

编辑

鉴于更新的数据,我编写了以下代码:

mean

同样,我们使用temperatureIn [1]: import pandas as pd In [2]: import numpy as np In [3]: from io import StringIO In [4]: s = """T,H,L,C,label ...: 1000.1,23.18,27.272,426,1 ...: 1000.2,23.15,27.2675,429.5,1 ...: 1000.3,23.15,27.245,426,1 ...: 1000.4,23.15,27.2,426,1 ...: 1000.5,23.1,27.2,426,1 ...: 1000.6,23.1,27.2,419,1 ...: 1000.7,23.1,27.2,419,1 ...: 1000.8,23.1,27.2,419,1 ...: 1000.9,23.1,27.2,419,1 ...: 1001,23.075,27.175,419,1 ...: 1001.1,23.075,27.15,419,1 ...: 1001.2,23.1,27.1,419,1 ...: 1001.3,23.1,27.16666667,419,1 ...: 1001.4,23.05,27.15,419,1 ...: 1001.5,23,27.125,419,1 ...: 1001.6,23,27.125,418.5,1 ...: 1001.7,23,27.2,0,0 ...: 1001.8,22.945,27.29,0,0 ...: 1001.9,22.945,27.39,0,0 ...: 1002,22.89,27.39,0,0 ...: 1002.1,22.89,27.39,0,0 ...: 1002.2,22.89,27.39,0,0 ...: 1002.3,22.89,27.445,0,0""" In [5]: df = pd.read_csv(StringIO(s), index_col='T') 以及agg来计算摘要统计信息。

IntervalIndex

这不会为您提供所需的列名,因此我们将groupby列展平以进行调整。

In [6]: step = 0.5
    ...: 
    ...: grouped = df.groupby(pd.cut(df.index,
    ...:                  np.arange(start=df.index.min(), stop=df.index.max(), step=step, dtype=float
    ...: )))
    ...: 

In [7]: grouped.agg({'H':['mean', 'std'], 'L':['mean', 'std'], 'C':['mean', 'std']})
Out[7]: 
                       H                    L                C          
                    mean       std       mean       std   mean       std
(1000.1, 1000.6]  23.130  0.027386  27.222500  0.031820  425.3  3.834058
(1000.6, 1001.1]  23.090  0.013693  27.185000  0.022361  419.0  0.000000
(1001.1, 1001.6]  23.050  0.050000  27.133333  0.025685  418.9  0.223607
(1001.6, 1002.1]  22.934  0.046016  27.332000  0.085557    0.0  0.000000

我不清楚您使用Z分数的含义,因为与MultiIndexIn [8]: aggregated = grouped.agg({'H':['mean', 'std'], 'L':['mean', 'std'], 'C':['mean', 'std']}) In [9]: ['_'.join(col).strip() for col in aggregated.columns.values] Out[9]: ['H_mean', 'H_std', 'L_mean', 'L_std', 'C_mean', 'C_std'] In [10]: aggregated.columns = ['_'.join(col).strip() for col in aggregated.columns.values] In [11]: aggregated Out[11]: H_mean H_std L_mean L_std C_mean C_std (1000.1, 1000.6] 23.130 0.027386 27.222500 0.031820 425.3 3.834058 (1000.6, 1001.1] 23.090 0.013693 27.185000 0.022361 419.0 0.000000 (1001.1, 1001.6] 23.050 0.050000 27.133333 0.025685 418.9 0.223607 (1001.6, 1002.1] 22.934 0.046016 27.332000 0.085557 0.0 0.000000 不同,这不是汇总统计信息,因此对agg效果不佳。如果您只想按列将Z分数应用于整个DataFrame,我建议您看一下这个问题:Pandas - Compute z-score for all columns