从文本列表中删除单词

时间:2015-10-20 15:25:03

标签: python text stop-words

我正在尝试从文本字符串列表中删除某些单词(除了使用停用词)但由于某种原因它无法正常工作

documents = ["Human machine interface for lab abc computer applications",
         "A survey of user opinion of computer system response time",
         "The EPS user interface management system",
         "System and human system engineering testing of EPS",
         "Relation of user perceived response time to error measurement",
         "The generation of random binary unordered trees",
         "The intersection graph of paths in trees",
         "Graph minors IV Widths of trees and well quasi ordering",
         "Graph minors A survey"]

exclude = ['am', 'there','here', 'for', 'of', 'user']

new_doc = [word for word in documents if word not in exclude]

print new_doc

输出

['Human machine interface for lab abc computer applications', 'A survey of user opinion of computer system response time', 'The EPS user interface management system', 'System and human system engineering testing of EPS', 'Relation of user perceived response time to error measurement', 'The generation of random binary unordered trees', 'The intersection graph of paths in trees', 'Graph minors IV Widths of trees and well quasi ordering', 'Graph minors A survey']

正如您所看到的,EXCLUDE中的任何单词都不会从文档中删除(例如"对于"是一个主要的例子)

它适用于此运营商:

new_doc = [word for word in str(documents).split() if word not in exclude]

但是如何在文档中找回初始元素(尽管是#34;清理过的")?

我将非常感谢您的帮助!

2 个答案:

答案 0 :(得分:3)

在过滤之前,您应该将线条分割为单词:

new_doc = [' '.join([word for word in line.split() if word not in exclude]) for line in documents]

答案 1 :(得分:1)

你正在循环句子而不是单词。为了达到这个目的,你需要拆分句子并使用嵌套循环来循环你的单词并过滤它们然后加入结果。

>>> new_doc = [' '.join([word for word in sent.split() if word not in exclude]) for sent in documents]
>>> 
>>> new_doc
['Human machine interface lab abc computer applications', 'A survey opinion computer system response time', 'The EPS interface management system', 'System and human system engineering testing EPS', 'Relation perceived response time to error measurement', 'The generation random binary unordered trees', 'The intersection graph paths in trees', 'Graph minors IV Widths trees and well quasi ordering', 'Graph minors A survey']
>>> 

此外,您可以使用regex代替嵌套列表理解和拆分及过滤,将exclude字替换为带re.sub函数的空字符串:

>>> import re
>>> 
>>> new_doc = [re.sub(r'|'.join(exclude),'',sent) for sent in documents]
>>> new_doc
['Human machine interface  lab abc computer applications', 'A survey   opinion  computer system response time', 'The EPS  interface management system', 'System and human system engineering testing  EPS', 'Relation   perceived response time to error measurement', 'The generation  random binary unordered trees', 'The intersection graph  paths in trees', 'Graph minors IV Widths  trees and well quasi ordering', 'Graph minors A survey']
>>> 

r'|'.join(exclude)将用pip连接单词(在正则表达式中表示逻辑OR)。