熊猫将数据帧分组到用户指定的时间段内

时间:2014-06-26 11:06:36

标签: python pandas

可能相关:pandas dataframe group year index by decade

例如,如果我有以下数据

                     status  bytes_sent upstream_cache_status  \
timestamp                                                       
2014-05-26 23:56:30     200         356                  MISS   
2014-05-26 23:56:30     200       10517                     -   
2014-05-26 23:57:05     200        6923                  MISS   
2014-05-26 23:57:14     200         323                     -   
2014-05-26 23:57:30     200         356                  MISS   
2014-05-26 23:57:38     200        8107                   HIT   
2014-05-26 23:57:43     200         369                  MISS   
2014-05-26 23:57:56     304         401                   HIT   
2014-05-26 23:57:56     304         401                   HIT   
2014-05-26 23:57:56     304         387                  MISS   
2014-05-26 23:57:57     304         401                   HIT   
2014-05-26 23:57:58     304         401                   HIT   
2014-05-26 23:58:08     200         507               EXPIRED   
2014-05-26 23:58:29     304         338                   HIT   
2014-05-26 23:58:31     400         409                     -   
2014-05-26 23:58:45     200         425                  MISS   

如果我想将它们分组,使得每个组在30秒内包含日志(时间是用户指定的),我该怎么做?我见过这个

df.groupby(lambda x: x.hour)

但我非常怀疑它在我的案例中是否相关

1 个答案:

答案 0 :(得分:1)

df.groupby(pd.Grouper(freq='30S', level=0))应该这样做;例如

>>> aggr = lambda df: df.apply(tuple)
>>> df.groupby(pd.Grouper(freq='30S', level=0)).aggregate(aggr)
                                                       status                                 bytes_sent  \
timestamp                                                                                                  
2014-06-26 23:56:30                                (200, 200)                               (356, 10517)   
2014-06-26 23:57:00                                (200, 200)                                (6923, 323)   
2014-06-26 23:57:30  (200, 200, 200, 304, 304, 304, 304, 304)  (356, 8107, 369, 401, 401, 387, 401, 401)   
2014-06-26 23:58:00                                (200, 304)                                 (507, 338)   
2014-06-26 23:58:30                                (400, 200)                                 (409, 425)   

                                           upstream_cache_status  
timestamp                                                         
2014-06-26 23:56:30                                    (MISS, -)  
2014-06-26 23:57:00                                    (MISS, -)  
2014-06-26 23:57:30  (MISS, HIT, MISS, HIT, HIT, MISS, HIT, HIT)  
2014-06-26 23:58:00                               (EXPIRED, HIT)  
2014-06-26 23:58:30                                    (-, MISS)