我正在尝试清理一组新闻文章(输入新闻文章) 我的输出是文章中最常见的10个词。
使用nltk停用词时,某些词仍然可以通过: ['the','would','said','one','also','like','could','he'] 所以我自己添加了它们作为停用词。我尝试了append方法和extend,如下面的代码所示。但是没有删除所需的停用词(要发出的词)“ the” 和“ he” 。
有人知道为什么吗? 还是知道我可能做错了什么?
(And yes; Ive googled it ALOT)
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import nltk
from nltk.corpus import stopwords
import string
import re
from IPython.display import display
from sklearn.feature_extraction.text import CountVectorizer
#importing dataset and making a copy as string
data = pd.read_csv('train.csv', encoding="ISO-8859-1")
data1 = data.copy()
data1.text = data1.text.astype(str)
to_drop = ['id',
'title',
'author',]
data1.drop(to_drop, inplace=True, axis=1)
#cleaning text for punctuation, whitespace, splitting, and set to lower
data1['text'] = data1['text'].str.strip().str.lower().str.replace('[^\w\s] ', '').str.split()
#removing stopwords
stopwords = nltk.corpus.stopwords.words('english')
custom_words = ['the','would','said','one','also','like','could','he']
stopwords.extend(custom_words)
data1['text'] = data1['text'].apply(lambda x: [item for item in x if item not in stopwords ])
data1['text']= data1['text'].apply(lambda x: " ".join(x))
vectorizer = CountVectorizer(max_features=1500, analyzer='word')
train_voc = vectorizer.fit_transform(data1['text'])
sum_words = train_voc.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
print (words_freq[:10])
display(data1.head())
输出:
[('the', 31138), ('people', 28975), ('new', 28495), ('trump', 24752), ('president', 18701), ('he', 17254), ('us', 16969), ('clinton', 16039), ('first', 15520), ('two', 15491)]
text label
0 house dem aidewe didnât even see comeyâs l... 1
1 ever get feeling life circles roundabout rathe... 0
2 truth might get fired october 292016 tension i... 1
3 videos 15 civilians killed single us airstrike... 1
4 print iranian woman sentenced six years prison... 1
有人要求举个例子。 这是一个example,您可以看到2个输出;一个在删除停用词之前,另一个在删除停用词之后。 此数据也是如此,只是其数据集更大,并且输出也是最常见的单词。 输出示例:
不带停用词:['This','is','a','sample','sentence',',','showing', 'off','the','stop','words','filtering','。']
带有停用词['This','sample','sentence',',','showing','stop', '单词','过滤','。']