我有一个格式(浮点数,字符串)的元组列表,按降序排序。
print sent_scores
[(0.10507038451969995,'Deadly stampede in Shanghai - Emergency personnel help victims.'),
(0.078586381821416265,'Deadly stampede in Shanghai - Police and medical staff help injured people after the stampede.'),
(0.072031446647399661, '- Emergency personnel help victims.')]
如果列表中有两个案例,其中四个单词在continuinty中相同。我想从列表中删除分数较低的元组。新列表也应该保留顺序。
上面的输出:
[(0.10507038451969995,'Deadly stampede in Shanghai - Emergency personnel help victims.')]
这首先肯定会涉及单词的标记化,可以通过以下代码完成:
from nltk.tokenize import TreebankWordTokenizer
def tokenize_words(text):
tokens = TreebankWordTokenizer().tokenize(text)
contractions = ["n't", "'ll", "'m","'s"]
fix = []
for i in range(len(tokens)):
for c in contractions:
if tokens[i] == c: fix.append(i)
fix_offset = 0
for fix_id in fix:
idx = fix_id - 1 - fix_offset
tokens[idx] = tokens[idx] + tokens[idx+1]
del tokens[idx+1]
fix_offset += 1
return tokens
tokenized_sents=[tokenize_words(sentence) for score,sentence in sent_scores]
我之前尝试将4个成组的每个句子的单词转换成一个集合,然后将issuperset用于其他句子。但它并没有检查连续性。
答案 0 :(得分:2)
我建议从你的标记化列表中连续获取4个令牌的序列,并制作一组这些令牌。通过使用Python的itertools模块,可以相当优雅地完成:
my_list = ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
i1 = itertools.islice(my_list, 0, None)
i2 = itertools.islice(my_list, 1, None)
i3 = itertools.islice(my_list, 2, None)
i4 = itertools.islice(my_list, 3, None)
print zip(i1, i2, i3, i4)
输出上面的代码(格式很适合你):
[('The', 'quick', 'brown', 'fox'),
('quick', 'brown', 'fox', 'jumps'),
('brown', 'fox', 'jumps', 'over'),
('fox', 'jumps', 'over', 'the'),
('jumps', 'over', 'the', 'lazy'),
('over', 'the', 'lazy', 'dog')]
实际上,更优雅的是:
my_list = ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
iterators = [itertools.islice(my_list, x, None) for x in range(4)]
print zip(*iterators)
与之前相同的输出。
现在你已经列出了每个列表的四个连续令牌(4个元组)的列表,你可以将这些令牌放在一个集合中,并检查相同的4元组是否出现在两个不同的集合中:
my_list = ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
set1 = set(zip(*[itertools.islice(my_list, x, None) for x in range(4)]))
other_list = ['The', 'quick', 'red', 'fox', 'goes', 'home']
set2 = set(zip(*[itertools.islice(other_list, x, None) for x in range(4)]))
print set1.intersection(set2) # Empty set
if set1.intersection(set2):
print "Found something in common"
else:
print "Nothing in common"
# Prints "Nothing in common"
third_list = ['The', 'quick', 'brown', 'fox', 'goes', 'to', 'school']
set3 = set(zip(*[itertools.islice(third_list, x, None) for x in range(4)]))
print set1.intersection(set3) # Set containing ('The', 'quick', 'brown', 'fox')
if set1.intersection(set3):
print "Found something in common"
else:
print "Nothing in common"
# Prints "Found something in common"
注意:如果您使用的是Python 3,只需将所有print "Something"
语句替换为print("Something")
:在Python 3中,print
成为了功能而不是声明。但如果您使用NLTK,我怀疑您使用的是Python 2。
重要提示:您创建的所有itertools.islice
个对象都将遍历其原始列表 一次 ,然后变为&#34 ;用尽" (他们已经返回了所有数据,因此将它们置于第二个for
循环中将不会产生任何效果,而for
循环只是不会做任何事情。如果要多次遍历同一个列表,请创建多个迭代器(如我在示例中所做的那样)。
更新:以下是如何消除得分较低的单词。首先,替换这一行:
tokenized_sents=[tokenize_words(sentence) for score,sentence in sent_scores]
使用:
tokenized_sents=[(score,tokenize_words(sentence)) for score,sentence in sent_scores]
现在你拥有的是一个(得分,句子)元组列表。然后我们将构建一个名为scores_and_sets
的列表,它将是(score,sets_of_four_words)元组的列表(其中sets_of_four_words
是一个四字切片列表,如上例所示): / p>
scores_and_sentences_and_sets = [(score, sentence, set(zip(*[itertools.islice(sentence, x, None) for x in range(4)]))) for score,sentence in tokenized_sents]
实际上,单行可能有点 聪明,所以让我们解压缩它更具可读性:
scores_and_sentences_and_sets = []
for score, sentence in tokenized_sents:
set_of_four_word_groups = set(zip(*[itertools.islice(sentence, x, None) for x in range(4)]))
score_sentence_and_sets_tuple = (score, sentence, set_of_four_word_groups)
scores_and_sentences_and_sets.append(score_sentence_and_sets_tuple)
继续尝试这两个代码片段,你会发现他们完全一样。
好的,现在我们有一个(得分,句子,set_of_four_word_groups)元组列表。因此,我们将按顺序浏览列表,并构建一个结果列表,其中仅包含我们要保留的句子。由于列表已经按降序排序,这使得事情变得更容易,因为这意味着在列表中的任何一点,我们只需要查看已经被接受的项目"看看他们中是否有任何重复;如果任何被接受的项目与我们刚看过的项目重复,我们甚至不需要查看分数,因为我们知道所接受的项目早于我们的项目。重新审视,因此它的分数必须高于我们所看到的分数。
所以这里有一些应该做你想做的代码:
accepted_items = []
for current_tuple in scores_and_sentences_and_sets:
score, sentence, set_of_four_words = current_tuple
found = False
for accepted_tuple in accepted_items:
accepted_score, accepted_sentence, accepted_set = accepted_tuple
if set_of_four_words.intersection(accepted_set):
found = True
break
if not found:
accepted_items.append(current_tuple)
print accepted_items # Prints a whole bunch of tuples
sentences_only = [sentence for score, sentence, word_set in accepted_items]
print sentences_only # Prints just the sentences