在熊猫中使用GroupBy和聚合功能

时间:2018-07-09 04:09:48

标签: python pandas pandas-groupby

我有一个如下的时间序列数据集。我想将其拆分为多个20个bin,获取每个bin中的最小和最大时间戳,并根据是否存在至少1个成功结果(成功:result = 0;失败:result)向每个bin添加一个标志= 1)

data = [{"product": "abc", "test_tstamp": 1530693399, "result": 1},
    {"product": "abc", "test_tstamp": 1530693405, "result": 0},
    {"product": "abc", "test_tstamp": 1530693410, "result": 1},
    {"product": "abc", "test_tstamp": 1530693411, "result": 0},
    {"product": "abc", "test_tstamp": 1530693415, "result": 0},
    {"product": "abc", "test_tstamp": 1530693420, "result": 0},
    {"product": "abc", "test_tstamp": 1530693430, "result": 0},
    {"product": "abc", "test_tstamp": 1530693431, "result": 0}]

我可以使用pandas.cut()将数据分割为20秒的间隔,并获取每个bin的最小和最大时间戳记

import numpy as np
import pandas as pd
arange = np.arange(1530693398, 1530693440, 20)
data = [{"product": "abc", "test_tstamp": 1530693399, "result": 1},
    {"product": "abc", "test_tstamp": 1530693405, "result": 0},
    {"product": "abc", "test_tstamp": 1530693410, "result": 1},
    {"product": "abc", "test_tstamp": 1530693411, "result": 0},
    {"product": "abc", "test_tstamp": 1530693415, "result": 0},
    {"product": "abc", "test_tstamp": 1530693420, "result": 1},
    {"product": "abc", "test_tstamp": 1530693430, "result": 1},
    {"product": "abc", "test_tstamp": 1530693431, "result": 1}]
df = pd.DataFrame(data)
df['bins'] = pd.cut(df['test_tstamp'], arange)
output_1 = df.groupby(["bins"]).agg({'result': np.ma.count, 'test_tstamp': {'mindate': np.min, 'maxdate': np.max}})

                         test_tstamp               result
                         maxdate     mindate       count
bins                                                   
(1530693398, 1530693418]  1530693415  1530693399      5
(1530693418, 1530693438]  1530693431  1530693420      3

并能够使用result success找到result failedgroupby()

output_2 = df.groupby(["bins", "result"]).result.count()
                                     result
 bins                     result        
 (1530693398, 1530693418] 0            3
                          1            2
 (1530693418, 1530693438] 0            3

我不确定如何将output_1output_2组合在一起,因此我希望有result countresult success而不是上面的result failed列和与每个flag关联的bin列。

预期输出:

                             test_tstamp               result    flag
                         maxdate     mindate      success failed  
bins                                                   
(1530693398, 1530693418]  1530693415  1530693399  3         2     True
(1530693418, 1530693438]  1530693431  1530693420  0         3    False

任何指针都会有所帮助!谢谢!

1 个答案:

答案 0 :(得分:1)

拆栈outptut_2,然后合并两个输出:

output_2 = (
    output_2
       .unstack(fill_value=0)
       .rename(columns={0 : 'success', 1 : 'failed'}))

df = (pd.concat([output_1.test_tstamp, output_2], axis=1, keys=['test_tstamp', 'result'])
        .assign(flag=output_2.success.gt(0)))

                         test_tstamp              result          flag
result                       mindate     maxdate success failed       
bins                                                                  
(1530693398, 1530693418]  1530693399  1530693415       3      2   True
(1530693418, 1530693438]  1530693420  1530693431       0      3  False