所以我有一个日志条目列表,看起来像这样:
a = [
{‘log’: ‘abc’, ‘time’: 0},
{‘log’: ‘123’, ‘time’: 1},
{‘log’: ‘def’, ‘time’: 2},
{‘log’: ‘abc’, ‘time’: 2},
{‘log’: ‘ghi’, ‘time’: 3},
{‘log’: ‘def’, ‘time’: 3}
]
时间精确到秒,但是标记为同时发生的事件可能相对于彼此以任何顺序发生。例如,在上面的列表中,a[5]
可能按时间顺序发生在a[4]
之前。
现在说出我要与a
匹配的一系列日志:
b = [
{‘log’: ‘abc’, ‘time’: 0},
{‘log’: ‘def’, ‘time’: 1},
{‘log’: ‘ghi’, ‘time’: 2}
]
我希望通过a
的日志条目找到b
的有序塞伯集合,其中subset[0]['time']
尽可能接近subset[-1]['time']
(换句话说,其中子集所跨越的持续时间尽可能短):
>>> f(a, b)
[{‘log’: ‘abc’, ‘time’: 2}, {‘log’: ‘ghi’, ‘time’: 3}, {‘log’: ‘def’, ‘time’: 3}]
编辑以进一步说明:
如果与a
匹配的b
的子集是:
# a[0], a[4], a[5]
a1 = [
{‘log’: ‘abc’, ‘time’: 0},
{‘log’: ‘ghi’, ‘time’: 3},
{‘log’: ‘def’, ‘time’: 3}
]
# a[3], a[4], a[5]
a2 = [
{‘log’: ‘abc’, ‘time’: 2},
{‘log’: ‘ghi’, ‘time’: 3},
{‘log’: ‘def’, ‘time’: 3}
]
然后a1
中的条目发生3秒钟,而a2
中的条目发生1秒钟。由于a2
中条目的持续时间比a1
中的条目的持续时间短,因此希望返回a2
。
答案 0 :(得分:1)
如果我正确理解了该问题,则此解决方案适用于所提供的示例数据。
总体方法是:
找到匹配项
寻找重复项
检查是否将每个重复项放回原始匹配项中都可以减少经过的时间。
重复此操作,直到没有重复项,并且经过的时间从原始匹配列表开始减少。或直到没有重复为止。
对于较大的问题,它有点令人费解并且可能效率很低,但是希望这些注释可以帮助您朝着正确的方向发展。
# -*- coding: UTF-8 -*-
from collections import Counter
import copy
def drop_repeated_logs(list_of_dicts):
"""
drop the repeated text in logs and compute a new time range
if the new elapsed time range is lower, then return that list of dictionaries
"""
only_logs = [d['log'] for d in list_of_dicts]
original_range = list_of_dicts[-1]['time'] - list_of_dicts[0]['time']
counts = Counter(only_logs)
original_max_count = counts[max(counts,key=lambda i:counts[i])]
original_len = len(list_of_dicts)
print(counts)
for log_txt in only_logs:
num_occ = counts[log_txt]
if num_occ > 1:
# list of matching log subsets without repeats
new_d = [entry for entry in list_of_dicts if entry['log']!=log_txt]
print new_d
# repeating log subset entries
entries_to_try = [entry for entry in list_of_dicts if entry['log']==log_txt]
print entries_to_try
for repeat in entries_to_try:
temp_d_list = copy.copy(new_d)
# add one of the repeated entries to the matches
temp_d_list.append(repeat)
newly_sorted = sorted(temp_d_list, key=lambda k:k["time"])
# check what the new "time elapsed"
new_range = newly_sorted[-1]['time'] - newly_sorted[0]['time']
print "Newly computed range of {}: {}\n".format(newly_sorted,new_range)
new_len = len(newly_sorted)
# we should return an updated list if the range is lower or we were able to get one repeated entry out
# see if the new time elapsed is an improvement from the original
if new_range < original_range :
print("Found a smaller range, returning: {}".format(new_range))
return (new_range,newly_sorted)
if new_range == original_range and new_len < original_len:
print("The range is unchanged, but got rid of a duplicate log text")
return (new_range,newly_sorted)
return original_range,list_of_dicts
b = [{"log": "abc", 'time': 0},
{'log': 'def', 'time': 1},
{'log': 'ghi', 'time': 2}
]
a = [
{'log': 'abc', 'time': 0},
{'log': '123', 'time': 1},
{'log': 'def', 'time': 2},
{'log': 'abc', 'time': 2},
{'log': 'ghi', 'time': 3},
{'log': 'def', 'time': 3}
]
a_logs = [d['log'] for d in a]
b_logs = [d['log'] for d in b]
def intersection(a,b):
return list(set(a)&set(b))
logs_of_interest = intersection(a_logs,b_logs)
matches_in_a = [entry for entry in a if entry['log'] in logs_of_interest]
sorted_matches = sorted(matches_in_a, key=lambda k: k['time'])
print(sorted_matches)
rnge = sorted_matches[-1]['time']-sorted_matches[0]['time']
sorted_logs = [d['log'] for d in sorted_matches]
log_counts = Counter(sorted_logs)
max_count = log_counts[max(log_counts,key=lambda i:log_counts[i])]
print "max count: {}".format(max_count)
# intitialize a lower range to get the while loop going
lower_range = rnge+1
while lower_range > rnge or max_count > 1:
lower_range, sorted_matches = drop_repeated_logs(sorted_matches)
sorted_logs = [d['log'] for d in sorted_matches]
log_counts = Counter(sorted_logs)
print("log counts: {}".format(log_counts))
max_count = log_counts[max(log_counts,key=lambda i:log_counts[i])]
print "MAX COUNT: {}".format(max_count)
print "NEW LOWER RANGE: {}".format(lower_range)
print("FINAL ANSWER: range: {}; {}".format(lower_range,sorted_matches))
> [{'log': 'abc', 'time': 2}, {'log': 'ghi', 'time': 3}, {'log': 'def', 'time': 3}]