如果它们是同一列表中另一个较大字符串的一部分,则组合字符串

时间:2017-09-29 08:03:44

标签: python algorithm

鉴于句子列表和列表中可能包含的单词,我想将它们从列表中排除,并将它们合并为最大的字符串(如果存在的话)。这个最大字符串的“部分”的每个外观应该计入最大字符串出现次数。

from collections import defaultdict

sentence_parts = ['quick brown', 'brown fox', 'fox', 'lazy dog',
                  'quick brown fox jumps over the lazy dog',]

sentences_with_count = defaultdict(int)

for s in sentence_parts:
    matching_sentences = sorted([si for si in sentence_parts if s in si and len(si) > len(s)],
                                key=len, reverse=True)
    if matching_sentences:
        current_sent_count = sentences_with_count.get(s, 1)
        sentences_with_count[matching_sentences[0]] += current_sent_count
    else:
        sentences_with_count[s] += 1

print(sentences_with_count)

因此sentences_with_count的输出将为:

{
    'quick brown fox jumps over the lazy dog': 5
}

这是repl.it

我知道这根本不高效。我该如何改进呢?

更多例子:

sentence_parts = ['The', 'Ohio State', 'Ohio', 
                  'Paris, France', 'Paris',
                  'The Ohio State University']

>>> {'The Ohio State University': 4, 'Paris, France': 2}

sentence_parts = ['Obama', 'Barack', 'Barack Hussein Obama']

>>> {'Barack Hussein Obama': 3}

sentence_parts = ['Obama', 'Barack', 'Barack Hussein Obama',
                  'Steve', 'Jobs', 'Steve Jobs', 'Mark', 'Bob']

>>> {'Barack Hussein Obama': 3, 'Steve Jobs': 3, 'Mark': 1, 'Bob': 1}

此方法的另一个问题:如果子字符串有多个匹配的字符串,则只增加最大的计数:

sentence_parts = ['The', 'The New York City', 'The Voice']
>>> {'The New York City': 2, 'The Voice': 1}

理想情况下,输出应为{'The New York City': 2, 'The Voice': 2}

2 个答案:

答案 0 :(得分:0)

这有点短,并且修复了最后描述的问题,只有最大的增量。

sentence_parts = ['The', 'Ohio State', 'Ohio', 
              'Paris, France', 'Paris',
              'The Ohio State University']
matching = {key:{'count':1, 'in': False} for key in sentence_parts}

for i in sentence_parts:
    for i2 in sentence_parts:
        if i in i2 and i != i2:
            matching[i2]['count'] += 1
            matching[i]['in'] = True

print({x: matching[x]['count'] for x in matching if not matching[x]['in']})

修改 <已删除

sentence_parts = sorted(sentence_parts, key=len)

因为没有必要

编辑2 :使用列表理解缩短字典创建。

答案 1 :(得分:0)

以下解决方案在概念上将问题分为两个操作,

  1. 查找每个句子的实际出现次数。
  2. 删除任何已用较大句子计算过的句子。
  3. 此解决方案将来更容易调试和扩展。

    from collections import defaultdict
    
    sentence_parts =  ['The', 'Ohio State', 'Ohio',
                       'Paris, France', 'Paris',
                       'The Ohio State University']
    
    sentences_with_count = defaultdict(int)
    for part in sentence_parts:
        for sentence in sentence_parts:
            if part in sentence:
                sentences_with_count[sentence] += 1
    
    # sentences_with_count contains values for all parts.
    # Next step is to filter the ones counted in bigger terms
    
    sentence_keys = list(sentences_with_count.keys())
    for k in sentence_keys:
        for other in sentence_keys:
            if k in other and k != other:
                sentences_with_count.pop(k,None) # Remove consumed terms
                break
    
    print(sentences_with_count)