python - 从网页中提取重复的单词并删除停用词

时间:2016-05-04 10:39:56

标签: python dataframe

在我计算其中单词的频率后,我试图从网页中删除停用词。以下是我的代码:

import requests
from bs4 import BeautifulSoup
from collections import Counter
import pandas as pd
from nltk.corpus import stopwords

stop = stopwords.words('english')
Website1 = requests.get("http://www.nerdwallet.com/the-best-credit-cards")
soup = BeautifulSoup(Website1.content)
texts = soup.findAll(text=True)
a = Counter([x.lower() for y in texts for x in y.split()])
b = (a.most_common())
makeaframe = pd.DataFrame(b)
makeaframe.columns = ['Words', 'Frequency']
print (makeaframe) --------1
makeaframe['Words']=makeaframe['Words'].apply(lambda x: [item for item in x if item not in stop])
print(makeaframe) ----------2

在第1点我得到一个输出,对我来说没问题:

     Words  Frequency
0     the        412
1     on        386
2     and        368
3     for        364
4     credit     340
5     a        335
6     to        295
7     card        269

现在在第1点之后我尝试删除了停用词,我期待以下内容:

     Words  Frequency
4     credit     340
7     card        269

但是,我得到了:

    Words  Frequency
0   [h, e]        412
1   [n]        386
2   [n]        368
3   [f, r]        364
4   [c, r, e]        340
5   []        335
6   []        295
7   [c, r]        269

我猜lambda函数正在逐字阅读并重新停止停用词然后我尝试了以下功能并且无法通过..

#print makeaframe.ix[:,'Words'].apply(lambda Words: [for Words not in stop])
#print makeaframe.ix[:,'Words'].apply(lambda Words: [item for item in Words if item not in stop])
#makeaframe['Words']=[word for word in makeaframe['Words'] if word not in stop]

我浏览互联网以解决此问题,但无法找到解决方案..请帮助

2 个答案:

答案 0 :(得分:0)

您应该能够像以下一样创建列表:

resultwords = [word for word in querywords if word.lower() not in stopwords]
result = ' '.join(resultwords)

final = Counter(resultwords)
print final    

从那里开始,我认为您可以很容易地转换为JSON。

答案 1 :(得分:0)

试试这个例子,你会获得所需的输出:

from urllib2 import urlopen
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize
from collections import Counter
import pandas as pd
from nltk.corpus import stopwords
import re
soup = BeautifulSoup(urlopen("http://www.nerdwallet.com/the-best-credit-cards"))

#To remove punctuation and number

word = re.sub("[^a-zA-Z]"," ",soup.getText())
#extract words
words = word_tokenize(word)
#remove stop words
stop_words = set(stopwords.words('english'))
filtered_words = [w for w in words if not w in stop_words]
a = Counter([x.lower() for y in filtered_words for x in y.split()])
b = (a.most_common())
makeaframe = pd.DataFrame(b)
makeaframe.columns = ['Words', 'Frequency']
makeaframe.head()

输出

     Words  Frequency
 0   credit        389
 1     card        332
 2  rewards        257
 3        i        245
 4        e        225