a = []
for i in range(0, len(content)):
norm = re.sub('[^a-zA-Z]', ' ', content['text'][i]).lower()
norm = norm.split()
norm = [PorterStemmer().stem(word) for word in norm if not word in
stopwords.words('english')]
norm = ' '.join(norm)
a.append(norm)
答案 0 :(得分:0)
从 set
创建一个 stopwords.words('english')
。使用:
excluded_words = set(stopwords.words('english'))
另外,不要一遍又一遍地实例化一个对象:
stemmer = PorterStemmer()
然后只需使用:
norm = [stemmer.stem(word) for word in norm if word not in excluded_words]
最后,停止像这样循环:for i in range(0, len(content)):
,直接循环内容。综合起来:
a = []
stemmer = PorterStemmer()
excluded_words = set(stopwords.words('english'))
for text in content["text"]:
norm = re.sub('[^a-zA-Z]', ' ', text).lower()
norm = norm.split()
norm = [stemmer.stem(word) for word in norm if word not in exlcuded_words]
a.append(' '.join(norm))
这里最有影响力的变化是使用 set
,要清楚。其余的只是小的优化。