按给定顺序检查列表的超集

时间:2015-10-12 09:34:12

标签: python string list set

我有一个格式(浮点数,字符串)的元组列表,按降序排序。

print sent_scores
[(0.10507038451969995,'Deadly stampede in Shanghai - Emergency personnel help victims.'),
 (0.078586381821416265,'Deadly stampede in Shanghai - Police and medical staff help injured people after the stampede.'),
 (0.072031446647399661, '- Emergency personnel help victims.')]

如果列表中有两个案例,其中四个单词在continuinty中相同。我想从列表中删除分数较低的元组。新列表也应该保留顺序。

上面的输出:

[(0.10507038451969995,'Deadly stampede in Shanghai - Emergency personnel help victims.')]

这首先肯定会涉及单词的标记化,可以通过以下代码完成:

from nltk.tokenize import TreebankWordTokenizer

def tokenize_words(text):
    tokens = TreebankWordTokenizer().tokenize(text)
    contractions = ["n't", "'ll", "'m","'s"]
    fix = []
    for i in range(len(tokens)):
        for c in contractions:
            if tokens[i] == c: fix.append(i)
    fix_offset = 0
    for fix_id in fix:
        idx = fix_id - 1 - fix_offset
        tokens[idx] = tokens[idx] + tokens[idx+1]
        del tokens[idx+1]
        fix_offset += 1
    return tokens
 tokenized_sents=[tokenize_words(sentence) for score,sentence in sent_scores]

我之前尝试将4个成组的每个句子的单词转换成一个集合,然后将issuperset用于其他句子。但它并没有检查连续性。

1 个答案:

答案 0 :(得分:2)

我建议从你的标记化列表中连续获取4个令牌的序列,并制作一组这些令牌。通过使用Python的itertools模块,可以相当优雅地完成:

my_list = ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
i1 = itertools.islice(my_list, 0, None)
i2 = itertools.islice(my_list, 1, None)
i3 = itertools.islice(my_list, 2, None)
i4 = itertools.islice(my_list, 3, None)
print zip(i1, i2, i3, i4)

输出上面的代码(格式很适合你):

[('The', 'quick', 'brown', 'fox'),
 ('quick', 'brown', 'fox', 'jumps'),
 ('brown', 'fox', 'jumps', 'over'),
 ('fox', 'jumps', 'over', 'the'),
 ('jumps', 'over', 'the', 'lazy'),
 ('over', 'the', 'lazy', 'dog')]

实际上,更优雅的是:

my_list = ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
iterators = [itertools.islice(my_list, x, None) for x in range(4)]
print zip(*iterators)

与之前相同的输出。

现在你已经列出了每个列表的四个连续令牌(4个元组)的列表,你可以将这些令牌放在一个集合中,并检查相同的4元组是否出现在两个不同的集合中:

my_list = ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
set1 = set(zip(*[itertools.islice(my_list, x, None) for x in range(4)]))

other_list = ['The', 'quick', 'red', 'fox', 'goes', 'home']
set2 = set(zip(*[itertools.islice(other_list, x, None) for x in range(4)]))

print set1.intersection(set2) # Empty set
if set1.intersection(set2):
    print "Found something in common"
else:
    print "Nothing in common"
# Prints "Nothing in common"

third_list = ['The', 'quick', 'brown', 'fox', 'goes', 'to', 'school']
set3 = set(zip(*[itertools.islice(third_list, x, None) for x in range(4)]))

print set1.intersection(set3) # Set containing ('The', 'quick', 'brown', 'fox')
if set1.intersection(set3):
    print "Found something in common"
else:
    print "Nothing in common"
# Prints "Found something in common"

注意:如果您使用的是Python 3,只需将所有print "Something"语句替换为print("Something"):在Python 3中,print成为了功能而不是声明。但如果您使用NLTK,我怀疑您使用的是Python 2。

重要提示:您创建的所有itertools.islice个对象都将遍历其原始列表 一次 ,然后变为&#34 ;用尽" (他们已经返回了所有数据,因此将它们置于第二个for循环中将不会产生任何效果,而for循环只是不会做任何事情。如果要多次遍历同一个列表,请创建多个迭代器(如我在示例中所做的那样)。

更新:以下是如何消除得分较低的单词。首先,替换这一行:

tokenized_sents=[tokenize_words(sentence) for score,sentence in sent_scores]

使用:

tokenized_sents=[(score,tokenize_words(sentence)) for score,sentence in sent_scores]

现在你拥有的是一个(得分,句子)元组列表。然后我们将构建一个名为scores_and_sets的列表,它将是(score,sets_of_four_words)元组的列表(其中sets_of_four_words是一个四字切片列表,如上例所示): / p>

scores_and_sentences_and_sets = [(score, sentence, set(zip(*[itertools.islice(sentence, x, None) for x in range(4)]))) for score,sentence in tokenized_sents]

实际上,单行可能有点 聪明,所以让我们解压缩它更具可读性:

scores_and_sentences_and_sets = []
for score, sentence in tokenized_sents:
    set_of_four_word_groups = set(zip(*[itertools.islice(sentence, x, None) for x in range(4)]))
    score_sentence_and_sets_tuple = (score, sentence, set_of_four_word_groups)
    scores_and_sentences_and_sets.append(score_sentence_and_sets_tuple)

继续尝试这两个代码片段,你会发现他们完全一样。

好的,现在我们有一个(得分,句子,set_of_four_word_groups)元组列表。因此,我们将按顺序浏览列表,并构建一个结果列表,其中仅包含我们要保留的句子。由于列表已经按降序排序,这使得事情变得更容易,因为这意味着在列表中的任何一点,我们只需要查看已经被接受的项目"看看他们中是否有任何重复;如果任何被接受的项目与我们刚看过的项目重复,我们甚至不需要查看分数,因为我们知道所接受的项目早于我们的项目。重新审视,因此它的分数必须高于我们所看到的分数。

所以这里有一些应该做你想做的代码:

accepted_items = []
for current_tuple in scores_and_sentences_and_sets:
    score, sentence, set_of_four_words = current_tuple
    found = False
    for accepted_tuple in accepted_items:
        accepted_score, accepted_sentence, accepted_set = accepted_tuple
        if set_of_four_words.intersection(accepted_set):
            found = True
            break
    if not found:
        accepted_items.append(current_tuple)
print accepted_items # Prints a whole bunch of tuples
sentences_only = [sentence for score, sentence, word_set in accepted_items]
print sentences_only # Prints just the sentences