Question

我有两个字典清单。两者都包含数据项以及开始和停止时间戳记。第一个列表包含字典，这些字典表示带有开始和停止时间的文本序列的观察结果。看起来像这样：

list_1 = [
      {'word': 'hey hows it going?', 's1': 1.2, 's2': 3.6},
      {'word': 'um', 's1': 3.7, 's2': 4.2},
      {'word': 'its raining outside today', 's1': 4.3, 's2': 5.0},
      {'word': 'and its really cold', 's1': 5.1, 's2': 6.6},
      {'word': 'dont you think?', 's1': 6.7, 's2': 8.1},
      {'word': 'its awful', 's1': 7.7, 's2': 9.0}
    ]

第二个列表包含字典，这些字典表示带有开始和停止时间的类别的观察结果。看起来像这样：

list_2 = [
  {'category': 0, 's1': 0.0, 's2': 3.8},
  {'category': 1, 's1': 3.9, 's2': 4.9},
  {'category': 1, 's1': 5.0, 's2': 7.2},
  {'category': 0, 's1': 7.3, 's2': 7.6},
  {'category': 1, 's1': 7.7, 's2': 9.0}
]

我想根据以下逻辑，使用list_2值在list_1['word']的词典中创建一个新项目：

如果list_1['s1']中的值大于list_2['s1']中的值且小于list_2['s2']中的值，请将list_1['word']中的所有值附加到新值中项目list_2['word']。
如果list_1['s1']中的值大于list_2['s1']中的值且小于list_2['s2']中的值，但list_1['s2']大于中的值list_2['s1']，将list_1['word']中的所有值附加到新项list_2['word']中，作为NEXT字典。

另一种思考方式是遍历list_1和list_2：

如果来自list_1项的时间戳落在list_2项的时间戳内，则将list_1字添加到list_2中的新键值对中。
如果来自list_1项的时间戳不在list_2项的时间戳之内，例如list_2[0]中的“开始”，而{{ 1}}，然后将list_2[1]中的list_1['words']添加到list_1[0]。

它应该看起来像这样：

list_2[1]

Answer 1

您的原始算法说“ NEXT”，您确定它是您想要的吗？我试图实现您所说的内容，但不清楚当一个短语重叠两个以上的说话者时会发生什么。

一些设计注意事项：

如果边界是[a, b)而不是[a, b]，那么您的数据会更有意义-3.65应该去哪里？
将值存储为列表（或按开始时间确定注入顺序）而不是将它们展平为带有空格的字符串可能更可重复使用。您以后总是可以将它们压平

START, END = 's1', 's2'

def require_speaker(start, end):
    ''' Return the latest speaker in start <= time <= end '''
    # This should be an interval tree if your data is large
    # https://en.wikipedia.org/wiki/Interval_tree

    # Exactly one of the first 3 is true, so we could use an `else`,
    # listing all for clarity.
    after = lambda v: v[END] < start
    overlaps = lambda v: start <= v[END] and v[START] <= end
    before = lambda v: end < v[START]
    contained = lambda v: v[START] <= start and end <= v[END]

    take_next = False
    for speaker in list_2:
        if take_next:
            return speaker
        if after(speaker):
            continue
        elif contained(speaker):
            return speaker
        elif overlaps(speaker):
            take_next = True
        elif after(speaker):
            break  # Missed it somehow (can't happen if full coverage)
    raise LookupError('no speaker in range %s - %s' % (start, end))

# Prepare a list for phrases
for speakers in list_2:
    speakers['words'] = []
# Populate phrases for each speaker
for phrase in list_1:
    speaker = require_speaker(phrase[START], phrase[END])
    speaker['words'].append(phrase['word'])
# Convert back to string
for speakers in list_2:
    speakers['words'] = ' '.join(speakers['words'])

使用您的数据

list_1 = [
      {'word': 'hey hows it going?', 's1': 1.2, 's2': 3.6},
      {'word': 'um', 's1': 3.7, 's2': 4.2},
      {'word': 'its raining outside today', 's1': 4.3, 's2': 5.0},
      {'word': 'and its really cold', 's1': 5.1, 's2': 6.6},
      {'word': 'dont you think?', 's1': 6.7, 's2': 8.1},
      {'word': 'its awful', 's1': 7.7, 's2': 9.0}
    ]

list_2 = [
  {'category': 0, 's1': 0.0, 's2': 3.8},
  {'category': 1, 's1': 3.9, 's2': 4.9},
  {'category': 1, 's1': 5.0, 's2': 7.2},
  {'category': 0, 's1': 7.3, 's2': 7.6},
  {'category': 1, 's1': 7.7, 's2': 9.0}
]

你得到

>>> import pprint
>>> pprint.pprint(list_2)
[{'category': 0, 's1': 0.0, 's2': 3.8, 'words': 'hey hows it going?'},
 {'category': 1, 's1': 3.9, 's2': 4.9, 'words': 'um'},
 {'category': 1,
  's1': 5.0,
  's2': 7.2,
  'words': 'its raining outside today and its really cold'},
 {'category': 0, 's1': 7.3, 's2': 7.6, 'words': 'dont you think?'},
 {'category': 1, 's1': 7.7, 's2': 9.0, 'words': 'its awful'}]

请注意，您的预期输出与您的算法不匹配：

“ um”（3.7-4.2）应该放在3.9-4.9范围内
“今天外面在下雨”（4.3-5.0）应该放在5.0-7.2范围内
“不要认为”（6.7-8.1）应该放在7.7-9.0范围内

循环遍历多个字典以从python中的值创建新字典

1 个答案: