汇总和填写缺失日期(天和小时)的数据

时间:2013-08-28 02:48:14

标签: python numpy pandas

假设我们有一个这样的列表,显示特定日期(mm-dd-yyyy-hour-minute)上每个对象的数量:

A = [
 [
    ['07-07-2012-21-04', 'orange', 1],
    ['08-16-2012-08-57', 'orange', 1],
    ['08-18-2012-03-30', 'orange', 1],
    ['08-18-2012-03-30', 'orange', 1],
    ['08-19-2012-03-58', 'orange', 1],
    ['08-19-2012-03-58', 'orange', 1],
    ['08-19-2012-04-09', 'orange', 1],
    ['08-19-2012-04-09', 'orange', 1],
    ['08-19-2012-05-21', 'orange', 1],
    ['08-19-2012-05-21', 'orange', 1],
    ['08-19-2012-06-03', 'orange', 1],
    ['08-19-2012-07-51', 'orange', 1],
    ['08-19-2012-08-17', 'orange', 1],
    ['08-19-2012-08-17', 'orange', 1]
 ],
 [
    ['07-07-2012-21-04', 'banana', 1]
 ],
 [
    ['07-07-2012-21-04', 'mango', 1],
    ['08-16-2012-08-57', 'mango', 1],
    ['08-18-2012-03-30', 'mango', 1],
    ['08-18-2012-03-30', 'mango', 1],
    ['08-19-2012-03-58', 'mango', 1],
    ['08-19-2012-03-58', 'mango', 1],
    ['08-19-2012-04-09', 'mango', 1],
    ['08-19-2012-04-09', 'mango', 1],
    ['08-19-2012-05-21', 'mango', 1],
    ['08-19-2012-05-21', 'mango', 1],
    ['08-19-2012-06-03', 'mango', 1],
    ['08-19-2012-07-51', 'mango', 1],
    ['08-19-2012-08-17', 'mango', 1],
    ['08-19-2012-08-17', 'mango', 1]
 ]

我在A中需要做的是填写每个对象的所有缺失日期(从最小日期到最大日期A),值为0.一旦缺少日期及其对应值(0),我想要总结每个日期的值,以便不为每个子列表重复日期。

现在,我想要的是:我将A的日期和值分开(在名为u和v的列表中)并将每个子列表转换为pandas Series,并为它们分配各自的索引。所以对于每个zip(u,v):

def generate(values, indices):

    indices = flatten(indices)

    date_index = DatetimeIndex(indices)
    ts = Series(values, index=date_index)

    ts.reindex(date_range(min(date_index), max(date_index)))

    return ts

但是在这里,重建索引导致引发异常。我正在寻找的是一种纯粹的pythonic方式(没有pandas),完全基于列表理解或者甚至是numpy数组。

还有一个小时聚合的问题,这意味着如果所有日期都相同且只有小时数不同,那么我想填写当天所有缺失的小时,然后重复相同的聚合过程。小时,缺少小时数填写0值。

提前致谢。

1 个答案:

答案 0 :(得分:2)

这个怎么样:

from collections import defaultdict, OrderedDict                              
from datetime import datetime, timedelta                                      
from itertools import chain, groupby                                          

flat = sorted((datetime.strptime(d, '%m-%d-%Y-%H-%M').date(), f, c)           
              for (d, f, c) in chain(*A))                                     
counts = [(d, f, sum(e[2] for e in l))                                        
          for (d, f), l                                                       
          in groupby(flat, key=lambda t: (t[0], t[1]))]                       

# lets assume that there are some data                                        
start = counts[0][0]                                                          
end = counts[-1][0]                                                           
result = OrderedDict((start+timedelta(days=i), defaultdict(int))             
                     for i in range((end-start).days+1))                      
for day, data in groupby(counts, key=lambda d: d[0]):                         
    result[day].update((f, c) for d, f, c in data)

我的问题是:我们真的是否需要填写不存在的日期 - 我很容易想象当数据很多的情况,甚至危险的数据量...我认为如果你想在某个地方列出它们,最好使用简单的通用函数和生成器:

from collections import defaultdict                                           
from datetime import datetime, timedelta                                      
from itertools import chain, groupby                                          

def aggregate(data, resolution='daily'):                                      
    assert resolution in ['hourly', 'daily']                                  
    if resolution == 'hourly':                                                
        round_dt = lambda dt: dt.replace(minute=0, second=0, microsecond=0)   
    else:                                                                     
        round_dt = lambda dt: dt.date()                                       

    flat = sorted((round_dt(datetime.strptime(d, '%m-%d-%Y-%H-%M')), f, c)    
                  for (d, f, c) in chain(*A))                                 
    counts = [(d, f, sum(e[2] for e in l))                                    
              for (d, f), l                                                   
              in groupby(flat, key=lambda t: (t[0], t[1]))]
    result = {}                                                              
    for day, data in groupby(counts, key=lambda d: d[0]):                    
        d = result[day] = defaultdict(int)                                   
        d.update((f, c) for d, f, c in data)                                 
    return result                                                            

def xaggregate(data, resolution='daily'):                                      
    aggregated = aggregate(data, resolution)                                 
    curr = min(aggregated.keys())                                            
    end = max(aggregated.keys())                                             
    interval = timedelta(days=1) if resolution == 'daily' else timedelta(seconds=3600)
    while curr <= end:
        # None is sensible value in case of missing data I think                                                       
        yield curr, aggregated.get(curr)                   
        curr += interval                                                                                 

一般来说,我的建议是你不应该将列表用作有序结构(我的意思是['07-07-2012-21-04', 'mango', 1])。我认为tuple更适合此目的,当然更需要collections.namedtuple