我有两个字典清单。两者都包含数据项以及开始和停止时间戳记。第一个列表包含字典,这些字典表示带有开始和停止时间的文本序列的观察结果。看起来像这样:
list_1 = [
{'word': 'hey hows it going?', 's1': 1.2, 's2': 3.6},
{'word': 'um', 's1': 3.7, 's2': 4.2},
{'word': 'its raining outside today', 's1': 4.3, 's2': 5.0},
{'word': 'and its really cold', 's1': 5.1, 's2': 6.6},
{'word': 'dont you think?', 's1': 6.7, 's2': 8.1},
{'word': 'its awful', 's1': 7.7, 's2': 9.0}
]
第二个列表包含字典,这些字典表示带有开始和停止时间的类别的观察结果。看起来像这样:
list_2 = [
{'category': 0, 's1': 0.0, 's2': 3.8},
{'category': 1, 's1': 3.9, 's2': 4.9},
{'category': 1, 's1': 5.0, 's2': 7.2},
{'category': 0, 's1': 7.3, 's2': 7.6},
{'category': 1, 's1': 7.7, 's2': 9.0}
]
我想根据以下逻辑,使用list_2
值在list_1['word']
的词典中创建一个新项目:
如果list_1['s1']
中的值大于list_2['s1']
中的值且小于list_2['s2']
中的值,请将list_1['word']
中的所有值附加到新值中项目list_2['word']
。
如果list_1['s1']
中的值大于list_2['s1']
中的值且小于list_2['s2']
中的值,但list_1['s2']
大于中的值list_2['s1']
,将list_1['word']
中的所有值附加到新项list_2['word']
中,作为NEXT字典。
另一种思考方式是遍历list_1和list_2:
如果来自list_1
项的时间戳落在list_2
项的时间戳内,则将list_1
字添加到list_2
中的新键值对中。
如果来自list_1
项的时间戳不在list_2
项的时间戳之内,例如list_2[0]
中的“开始”,而{{ 1}},然后将list_2[1]
中的list_1['words']
添加到list_1[0]
。
它应该看起来像这样:
list_2[1]
答案 0 :(得分:1)
您的原始算法说“ NEXT”,您确定它是您想要的吗?我试图实现您所说的内容,但不清楚当一个短语重叠两个以上的说话者时会发生什么。
一些设计注意事项:
[a, b)
而不是[a, b]
,那么您的数据会更有意义-3.65应该去哪里?START, END = 's1', 's2'
def require_speaker(start, end):
''' Return the latest speaker in start <= time <= end '''
# This should be an interval tree if your data is large
# https://en.wikipedia.org/wiki/Interval_tree
# Exactly one of the first 3 is true, so we could use an `else`,
# listing all for clarity.
after = lambda v: v[END] < start
overlaps = lambda v: start <= v[END] and v[START] <= end
before = lambda v: end < v[START]
contained = lambda v: v[START] <= start and end <= v[END]
take_next = False
for speaker in list_2:
if take_next:
return speaker
if after(speaker):
continue
elif contained(speaker):
return speaker
elif overlaps(speaker):
take_next = True
elif after(speaker):
break # Missed it somehow (can't happen if full coverage)
raise LookupError('no speaker in range %s - %s' % (start, end))
# Prepare a list for phrases
for speakers in list_2:
speakers['words'] = []
# Populate phrases for each speaker
for phrase in list_1:
speaker = require_speaker(phrase[START], phrase[END])
speaker['words'].append(phrase['word'])
# Convert back to string
for speakers in list_2:
speakers['words'] = ' '.join(speakers['words'])
使用您的数据
list_1 = [
{'word': 'hey hows it going?', 's1': 1.2, 's2': 3.6},
{'word': 'um', 's1': 3.7, 's2': 4.2},
{'word': 'its raining outside today', 's1': 4.3, 's2': 5.0},
{'word': 'and its really cold', 's1': 5.1, 's2': 6.6},
{'word': 'dont you think?', 's1': 6.7, 's2': 8.1},
{'word': 'its awful', 's1': 7.7, 's2': 9.0}
]
list_2 = [
{'category': 0, 's1': 0.0, 's2': 3.8},
{'category': 1, 's1': 3.9, 's2': 4.9},
{'category': 1, 's1': 5.0, 's2': 7.2},
{'category': 0, 's1': 7.3, 's2': 7.6},
{'category': 1, 's1': 7.7, 's2': 9.0}
]
你得到
>>> import pprint
>>> pprint.pprint(list_2)
[{'category': 0, 's1': 0.0, 's2': 3.8, 'words': 'hey hows it going?'},
{'category': 1, 's1': 3.9, 's2': 4.9, 'words': 'um'},
{'category': 1,
's1': 5.0,
's2': 7.2,
'words': 'its raining outside today and its really cold'},
{'category': 0, 's1': 7.3, 's2': 7.6, 'words': 'dont you think?'},
{'category': 1, 's1': 7.7, 's2': 9.0, 'words': 'its awful'}]
请注意,您的预期输出与您的算法不匹配: