我正在尝试从文本字符串列表中删除某些单词(除了使用停用词)但由于某种原因它无法正常工作
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
exclude = ['am', 'there','here', 'for', 'of', 'user']
new_doc = [word for word in documents if word not in exclude]
print new_doc
输出
['Human machine interface for lab abc computer applications', 'A survey of user opinion of computer system response time', 'The EPS user interface management system', 'System and human system engineering testing of EPS', 'Relation of user perceived response time to error measurement', 'The generation of random binary unordered trees', 'The intersection graph of paths in trees', 'Graph minors IV Widths of trees and well quasi ordering', 'Graph minors A survey']
正如您所看到的,EXCLUDE中的任何单词都不会从文档中删除(例如"对于"是一个主要的例子)
它适用于此运营商:
new_doc = [word for word in str(documents).split() if word not in exclude]
但是如何在文档中找回初始元素(尽管是#34;清理过的")?
我将非常感谢您的帮助!
答案 0 :(得分:3)
在过滤之前,您应该将线条分割为单词:
new_doc = [' '.join([word for word in line.split() if word not in exclude]) for line in documents]
答案 1 :(得分:1)
你正在循环句子而不是单词。为了达到这个目的,你需要拆分句子并使用嵌套循环来循环你的单词并过滤它们然后加入结果。
>>> new_doc = [' '.join([word for word in sent.split() if word not in exclude]) for sent in documents]
>>>
>>> new_doc
['Human machine interface lab abc computer applications', 'A survey opinion computer system response time', 'The EPS interface management system', 'System and human system engineering testing EPS', 'Relation perceived response time to error measurement', 'The generation random binary unordered trees', 'The intersection graph paths in trees', 'Graph minors IV Widths trees and well quasi ordering', 'Graph minors A survey']
>>>
此外,您可以使用regex
代替嵌套列表理解和拆分及过滤,将exclude
字替换为带re.sub
函数的空字符串:
>>> import re
>>>
>>> new_doc = [re.sub(r'|'.join(exclude),'',sent) for sent in documents]
>>> new_doc
['Human machine interface lab abc computer applications', 'A survey opinion computer system response time', 'The EPS interface management system', 'System and human system engineering testing EPS', 'Relation perceived response time to error measurement', 'The generation random binary unordered trees', 'The intersection graph paths in trees', 'Graph minors IV Widths trees and well quasi ordering', 'Graph minors A survey']
>>>
r'|'.join(exclude)
将用pip连接单词(在正则表达式中表示逻辑OR)。