我有这个数据结构,每个团队都有开始/结束日期的问题列表。
对于每个团队,我想合并具有相同密钥和重叠日期的问题,在结果问题中,开始日期将是较小的日期,结束日期将是更大的日期。
我尝试使用少量LoggingModule
循环进行此操作,但我想知道最好的Pythonic方法是什么。
我想只在同一个团队中合并具有相同密钥的问题,并且日期重叠。
问题不是按时间顺序排列的。
输入:
for
输出:
{
'Team A': [{
'start': '11/Jul/13 1:49 PM',
'end': '10/Oct/13 5:16 PM',
'issue': 'KEY-12678'
}, {
'start': '3/Oct/13 10:40 AM',
'end': '11/Nov/13 1:02 PM',
'issue': 'KEY-12678'
}],
'Team B': [{
'start': '5/Sep/13 3:35 PM',
'end': '08/Nov/13 3:35 PM',
'issue': 'KEY-12679'
}, {
'start': '19/Aug/13 5:05 PM',
'end': '10/Sep/13 5:16 PM',
'issue': 'KEY-12679'
}, {
'start': '09/Jul/13 9:15 AM',
'end': '29/Jul/13 9:15 AM',
'issue': 'KEY-12680'
}]
}
要解析日期,这里是日期格式(为了节省几分钟):
{
'Team A': [{
'start': '11/Jul/13 1:49 PM',
'end': '11/Nov/13 1:02 PM',
'issue': 'KEY-12678'
}],
'Team B': [{
'start': '19/Aug/13 5:05 PM',
'end': '08/Nov/13 3:35 PM',
'issue': 'KEY-12679'
}, {
'start': '09/Jul/13 9:15 AM',
'end': '29/Jul/13 9:15 AM',
'issue': 'KEY-12680'
}]
}
输入
date_format = "%d/%b/%y %H:%M %p"
输出
d = {
"N/A": [
{'start': '23/Jun/14 8:48 PM', 'end': '01/Aug/14 11:00 PM', 'issue': 'KEY-12157'}
,{'start': '09/Jul/13 1:57 PM', 'end': '29/Jul/13 1:57 PM', 'issue': 'KEY-12173'}
,{'start': '21/Aug/13 12:29 PM', 'end': '02/Dec/13 6:06 PM', 'issue': 'KEY-12173'}
,{'start': '17/Feb/14 3:17 PM', 'end': '18/Feb/14 5:51 PM', 'issue': 'KEY-12173'}
,{'start': '12/May/14 4:42 PM', 'end': '02/Jun/14 4:42 PM', 'issue': 'KEY-12173'}
,{'start': '24/Jun/14 11:33 AM', 'end': '01/Aug/14 11:49 AM', 'issue': 'KEY-12173'}
,{'start': '07/Oct/14 1:17 PM', 'end': '17/Nov/14 10:30 AM', 'issue': 'KEY-12173'}
,{'start': '31/Mar/15 1:58 PM', 'end': '12/May/15 4:26 PM', 'issue': 'KEY-12173'}
,{'start': '15/Jul/14 10:06 AM', 'end': '15/Sep/14 5:25 PM', 'issue': 'KEY-12173'}
,{'start': '06/Jan/15 10:46 AM', 'end': '26/Jan/15 10:46 AM', 'issue': 'KEY-20628'}
,{'start': '18/Nov/14 5:08 PM', 'end': '16/Feb/15 1:31 PM', 'issue': 'KEY-20628'}
,{'start': '02/Oct/13 12:32 PM', 'end': '21/Oct/13 5:32 PM', 'issue': 'KEY-12146'}
,{'start': '11/Mar/14 12:08 PM', 'end': '31/Mar/14 12:08 PM', 'issue': 'KEY-12681'}
]}
答案 0 :(得分:0)
您可以使用'%d/%b/%y %H:%M %p'
格式和datetime.strftime
函数将字符串日期转换为有效的python日期时间对象,并使用itertools.groupby
根据issue
键对子字典进行分组可以使用适当的键函数循环遍历ziped组并使用max
和min
函数提取最大值和最小值:
from datetime import datetime
from itertools import groupby
from operator import itemgetter
new={}
for key in d:
for dic in [ zip(*[i.items() for i in g]) for _,g in groupby(d[key],itemgetter('issue'))] :
temp={}
for p,t in [zip(*tup) for tup in dic]:
val=p[0]
if val=='start':
temp[val]=min(t,key=lambda x:datetime.strptime(x,'%d/%b/%y %H:%M %p'))
elif val=='end':
temp[val]=max(t,key=lambda x:datetime.strptime(x,'%d/%b/%y %H:%M %p'))
else:
temp[val]=t[0]
new.setdefault(key,[]).append(temp)
print new
结果:
{'Team A': [{'start': '11/Jul/13 1:49 PM', 'end': '11/Nov/13 1:02 PM', 'issue': 'KEY-12678'}],
'Team B': [{'start': '19/Aug/13 5:05 PM', 'end': '08/Nov/13 3:35 PM', 'issue': 'KEY-12679'},
{'start': '09/Jul/13 9:15 AM', 'end': '29/Jul/13 9:15 AM', 'issue': 'KEY-12680'}]}
答案 1 :(得分:0)
这是我目前的代码,似乎有效(检查起来有点棘手)。
在我的代码中,我使用的名称为epic
和mr
,其中示例数据中的每一行都是epic
,但问题密钥是mr
。
from datetime import datetime
date_format = "%d/%b/%y %H:%M %p"
d = {"team" : [... sample data ...]}
def get_list_of_mrs(epics):
mrs = set()
for epic in epics:
mrs.add(epic['issue'])
return mrs
def is_overlap(epic1, epic2):
start1 = datetime.strptime(epic1['start'], date_format)
end1 = datetime.strptime(epic1['end'], date_format)
start2 = datetime.strptime(epic2['start'], date_format)
end2 = datetime.strptime(epic2['end'], date_format)
return ((start1 <= end2) and (end1 >= start2))
def get_overlapping_dates(epic1, epic2):
start1 = datetime.strptime(epic1['start'], date_format)
end1 = datetime.strptime(epic1['end'], date_format)
start2 = datetime.strptime(epic2['start'], date_format)
end2 = datetime.strptime(epic2['end'], date_format)
return (min(start1, start2), max(end1, end2))
def remove_overlaps(epics):
filtered_epics = []
for epic in epics:
for temp_epic in epics:
if temp_epic == epic:
continue
if epic.has_key('overlap'):
continue
if is_overlap(epic, temp_epic):
temp_epic['overlap'] = True
new_start, new_end = get_overlapping_dates(epic, temp_epic)
epic['start'] = new_start.strftime(date_format)
epic['end'] = new_end.strftime(date_format)
filtered_epics.append(epic)
filtered_epics = filter(lambda x: not x.has_key('overlap'), filtered_epics)
return filtered_epics
for team in d:
epics = d[team]
epics.sort(key=lambda x: datetime.strptime(x['start'], date_format))
uniq_mrs_in_team = get_list_of_mrs(epics)
filtered_mrs = []
for mr in uniq_mrs_in_team:
mr_epics = filter(lambda x: x['issue'] == mr, epics)
filtered = remove_overlaps(mr_epics)
#print team, mr, len(mr_epics), len(filtered)
for x_mr in mr_epics:
#print " -",x_mr
pass
for x_mr in filtered:
#print " +",x_mr
pass
filtered_mrs.extend(filtered)
d[team] = filtered_mrs
答案 2 :(得分:0)
我在评论中提出了由aquavitae暗示的大熊猫解决方案,其中包含以下步骤:
这看起来像:
import pandas as pd
import numpy as np
df = pd.DataFrame(d['N/A'])
df['end'] = pd.to_datetime(df['end'])
df['start'] = pd.to_datetime(df['start'])
df.sort(['issue', 'start'], inplace=True)
df.index = range(len(df))
time_overlaps = df[:-1]['end'] > df[1:]['start']
same_issue = df[:-1]['issue'] == df[1:]['issue']
rows_to_drop = np.logical_and(time_overlaps, same_issue)
rows_to_drop_indices = [i+1 for i, j in enumerate(rows_to_drop) if j]
for i in rows_to_drop_indices:
df.loc[i-1, 'end'] = df.loc[i, 'end']
df.drop(rows_to_drop_indices, inplace=True)
如果您不想保留DataFrame对象并以您在问题中指定的格式进行进一步计算,请执行以下操作:
df.to_dict('records')
编辑:找到一种有效的方法!