假设我们有一个这样的列表,显示特定日期(mm-dd-yyyy-hour-minute)上每个对象的数量:
A = [
[
['07-07-2012-21-04', 'orange', 1],
['08-16-2012-08-57', 'orange', 1],
['08-18-2012-03-30', 'orange', 1],
['08-18-2012-03-30', 'orange', 1],
['08-19-2012-03-58', 'orange', 1],
['08-19-2012-03-58', 'orange', 1],
['08-19-2012-04-09', 'orange', 1],
['08-19-2012-04-09', 'orange', 1],
['08-19-2012-05-21', 'orange', 1],
['08-19-2012-05-21', 'orange', 1],
['08-19-2012-06-03', 'orange', 1],
['08-19-2012-07-51', 'orange', 1],
['08-19-2012-08-17', 'orange', 1],
['08-19-2012-08-17', 'orange', 1]
],
[
['07-07-2012-21-04', 'banana', 1]
],
[
['07-07-2012-21-04', 'mango', 1],
['08-16-2012-08-57', 'mango', 1],
['08-18-2012-03-30', 'mango', 1],
['08-18-2012-03-30', 'mango', 1],
['08-19-2012-03-58', 'mango', 1],
['08-19-2012-03-58', 'mango', 1],
['08-19-2012-04-09', 'mango', 1],
['08-19-2012-04-09', 'mango', 1],
['08-19-2012-05-21', 'mango', 1],
['08-19-2012-05-21', 'mango', 1],
['08-19-2012-06-03', 'mango', 1],
['08-19-2012-07-51', 'mango', 1],
['08-19-2012-08-17', 'mango', 1],
['08-19-2012-08-17', 'mango', 1]
]
我在A中需要做的是填写每个对象的所有缺失日期(从最小日期到最大日期A),值为0.一旦缺少日期及其对应值(0),我想要总结每个日期的值,以便不为每个子列表重复日期。
现在,我想要的是:我将A的日期和值分开(在名为u和v的列表中)并将每个子列表转换为pandas Series,并为它们分配各自的索引。所以对于每个zip(u,v):
def generate(values, indices):
indices = flatten(indices)
date_index = DatetimeIndex(indices)
ts = Series(values, index=date_index)
ts.reindex(date_range(min(date_index), max(date_index)))
return ts
但是在这里,重建索引导致引发异常。我正在寻找的是一种纯粹的pythonic方式(没有pandas),完全基于列表理解或者甚至是numpy数组。
还有一个小时聚合的问题,这意味着如果所有日期都相同且只有小时数不同,那么我想填写当天所有缺失的小时,然后重复相同的聚合过程。小时,缺少小时数填写0值。
提前致谢。
答案 0 :(得分:2)
这个怎么样:
from collections import defaultdict, OrderedDict
from datetime import datetime, timedelta
from itertools import chain, groupby
flat = sorted((datetime.strptime(d, '%m-%d-%Y-%H-%M').date(), f, c)
for (d, f, c) in chain(*A))
counts = [(d, f, sum(e[2] for e in l))
for (d, f), l
in groupby(flat, key=lambda t: (t[0], t[1]))]
# lets assume that there are some data
start = counts[0][0]
end = counts[-1][0]
result = OrderedDict((start+timedelta(days=i), defaultdict(int))
for i in range((end-start).days+1))
for day, data in groupby(counts, key=lambda d: d[0]):
result[day].update((f, c) for d, f, c in data)
我的问题是:我们真的是否需要填写不存在的日期 - 我很容易想象当数据很多的情况,甚至危险的数据量...我认为如果你想在某个地方列出它们,最好使用简单的通用函数和生成器:
from collections import defaultdict
from datetime import datetime, timedelta
from itertools import chain, groupby
def aggregate(data, resolution='daily'):
assert resolution in ['hourly', 'daily']
if resolution == 'hourly':
round_dt = lambda dt: dt.replace(minute=0, second=0, microsecond=0)
else:
round_dt = lambda dt: dt.date()
flat = sorted((round_dt(datetime.strptime(d, '%m-%d-%Y-%H-%M')), f, c)
for (d, f, c) in chain(*A))
counts = [(d, f, sum(e[2] for e in l))
for (d, f), l
in groupby(flat, key=lambda t: (t[0], t[1]))]
result = {}
for day, data in groupby(counts, key=lambda d: d[0]):
d = result[day] = defaultdict(int)
d.update((f, c) for d, f, c in data)
return result
def xaggregate(data, resolution='daily'):
aggregated = aggregate(data, resolution)
curr = min(aggregated.keys())
end = max(aggregated.keys())
interval = timedelta(days=1) if resolution == 'daily' else timedelta(seconds=3600)
while curr <= end:
# None is sensible value in case of missing data I think
yield curr, aggregated.get(curr)
curr += interval
一般来说,我的建议是你不应该将列表用作有序结构(我的意思是['07-07-2012-21-04', 'mango', 1]
)。我认为tuple
更适合此目的,当然更需要collections.namedtuple
。