我有一个包含很多标记双字母的列表。有些bigrams没有正确标记,所以我想从主列表中删除它们。 bigrams中的一个词经常重复,所以如果它包含一个xyz词,我可以删除它。 Psudo的例子如下:
master_list = ['this is', 'is a', 'a sample', 'sample word', 'sample text', 'this book', 'a car', 'literary text', 'new book', 'them about', 'on the' , 'in that', 'tagged corpus', 'on top', 'a car', 'an orange', 'the book', 'them what', 'then how']
unwanted_words = ['this', 'is', 'a', 'on', 'in', 'an', 'the', 'them']
new_list = [item for item in master_list if not [x for x in unwanted_words] in item]
我可以单独删除这些项目,即每次创建一个列表并删除包含该词的项目,比如说' on'。这很繁琐,需要数小时的过滤并创建新的列表来过滤每个不需要的单词。我认为循环会有所帮助。但是,我收到以下类型错误:
Traceback (most recent call last):
File "<pyshell#21>", line 1, in <module>
new_list = [item for item in master_list if not [x for x in unwanted_words] in item]
File "<pyshell#21>", line 1, in <listcomp>
new_list = [item for item in master_list if not [x for x in unwanted_words] in item]
TypeError: 'in <string>' requires string as left operand, not list
非常感谢您的帮助!
答案 0 :(得分:1)
您的条件if not [x for x in unwanted_words] in item
与if not unwanted_words in item
相同,即您正在检查列表是否包含在字符串中。
相反,您可以使用any
来检查双字母组的任何部分是否在unwanted_words
中。另外,您可以unwanted_words
set
来加快查找速度。
>>> master_list = ['this is', 'is a', 'a sample', 'sample word', 'sample text', 'this book', 'a car', 'literary text', 'new book', 'them about', 'on the' , 'in that', 'tagged corpus', 'on top', 'a car', 'an orange', 'the book', 'them what', 'then how']
>>> unwanted_words = set(['this', 'is', 'a', 'on', 'in', 'an', 'the', 'them'])
>>> [item for item in master_list if not any(x in unwanted_words for x in item.split())]
['sample word', 'sample text', 'literary text', 'new book', 'tagged corpus', 'then how']