我想知道如何从最常见的单词列表中删除停用词。我只想说话。示例结构如下:
sentence = [('the', 2112), ('and', 1914), ('to', 1505), ('of', 1086), ('a', 986), ('you', 912),
('in', 754), ('with', 549), ('is', 536), ('for', 473), ('it', 461), ('book', 427),
('how', 368), ('that', 347), ('as', 304), ('on', 301), ('this', 290), ('java', 289),
('s', 267), ('your', 263), ('applications', 248), ('web', 231), ('can', 219),
('new', 218), ('an', 206), ('are', 197), ('will', 187), ('from', 185), ('use', 185), ('ll', 183),
('development', 182), ('code', 180), ('by', 177), ('programming', 172), ('application', 170), ('or', 169),
('action', 163), ('developers', 150), ('features', 141), ('examples', 139), ('learn', 135), ('using', 132),
('be', 132), ('data', 131), ('more', 118), ('like', 115), ('build', 110), ('into', 109), ('net', 106), ('language', 105)]
感谢任何帮助。
答案 0 :(得分:2)
您应首先创建 set 的停用词,然后您可以使用以下内容将其过滤掉:
>>> stopList = {'the','and','to','in'}
>>> [(word, count) for word, count in sentence if word not in stopList]
答案 1 :(得分:2)
如果您想要一套完整的停用词,可以使用nltk中的可用列表,如下所示:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
sentence = [('the', 2112), ('and', 1914), ('to', 1505), ('of', 1086), ('a', 986), ('you', 912),
('in', 754), ('with', 549), ('is', 536), ('for', 473), ('it', 461), ('book', 427),
('how', 368), ('that', 347), ('as', 304), ('on', 301), ('this', 290), ('java', 289),
('s', 267), ('your', 263), ('applications', 248), ('web', 231), ('can', 219),
('new', 218), ('an', 206), ('are', 197), ('will', 187), ('from', 185), ('use', 185), ('ll', 183),
('development', 182), ('code', 180), ('by', 177), ('programming', 172), ('application', 170), ('or', 169),
('action', 163), ('developers', 150), ('features', 141), ('examples', 139), ('learn', 135), ('using', 132),
('be', 132), ('data', 131), ('more', 118), ('like', 115), ('build', 110), ('into', 109), ('net', 106), ('language', 105)]
sentence = [(word, count) for word, count in sentence if word not in stop_words]
print sentence
这会给你sentence
:
[('book', 427), ('java', 289), ('applications', 248), ('web', 231), ('new', 218), ('use', 185), ('development', 182), ('code', 180), ('programming', 172), ('application', 170), ('action', 163), ('developers', 150), ('features', 141), ('examples', 139), ('learn', 135), ('using', 132), ('data', 131), ('like', 115), ('build', 110), ('net', 106), ('language', 105)]
您可以使用pip install nltk
来获取图书馆。然后,您可能需要先按如下方式安装停用词:
import nltk
nltk.download()
这将显示一个下载实用程序,允许您按如下方式获取停用词:
答案 2 :(得分:0)
set将获得O(1)中的搜索结果,out_tup将具有所需的输出
in_tup = [('the', 2112), ('and', 1914), ('to', 1505)]
stop_list = {"to","the"}
out_tup = [i for i in in_tup if i[0] not in stop_list]
print out_tup