在我计算其中单词的频率后,我试图从网页中删除停用词。以下是我的代码:
import requests
from bs4 import BeautifulSoup
from collections import Counter
import pandas as pd
from nltk.corpus import stopwords
stop = stopwords.words('english')
Website1 = requests.get("http://www.nerdwallet.com/the-best-credit-cards")
soup = BeautifulSoup(Website1.content)
texts = soup.findAll(text=True)
a = Counter([x.lower() for y in texts for x in y.split()])
b = (a.most_common())
makeaframe = pd.DataFrame(b)
makeaframe.columns = ['Words', 'Frequency']
print (makeaframe) --------1
makeaframe['Words']=makeaframe['Words'].apply(lambda x: [item for item in x if item not in stop])
print(makeaframe) ----------2
在第1点我得到一个输出,对我来说没问题:
Words Frequency
0 the 412
1 on 386
2 and 368
3 for 364
4 credit 340
5 a 335
6 to 295
7 card 269
现在在第1点之后我尝试删除了停用词,我期待以下内容:
Words Frequency
4 credit 340
7 card 269
但是,我得到了:
Words Frequency
0 [h, e] 412
1 [n] 386
2 [n] 368
3 [f, r] 364
4 [c, r, e] 340
5 [] 335
6 [] 295
7 [c, r] 269
我猜lambda函数正在逐字阅读并重新停止停用词然后我尝试了以下功能并且无法通过..
#print makeaframe.ix[:,'Words'].apply(lambda Words: [for Words not in stop])
#print makeaframe.ix[:,'Words'].apply(lambda Words: [item for item in Words if item not in stop])
#makeaframe['Words']=[word for word in makeaframe['Words'] if word not in stop]
我浏览互联网以解决此问题,但无法找到解决方案..请帮助
答案 0 :(得分:0)
您应该能够像以下一样创建列表:
resultwords = [word for word in querywords if word.lower() not in stopwords]
result = ' '.join(resultwords)
final = Counter(resultwords)
print final
从那里开始,我认为您可以很容易地转换为JSON。
答案 1 :(得分:0)
试试这个例子,你会获得所需的输出:
from urllib2 import urlopen
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize
from collections import Counter
import pandas as pd
from nltk.corpus import stopwords
import re
soup = BeautifulSoup(urlopen("http://www.nerdwallet.com/the-best-credit-cards"))
#To remove punctuation and number
word = re.sub("[^a-zA-Z]"," ",soup.getText())
#extract words
words = word_tokenize(word)
#remove stop words
stop_words = set(stopwords.words('english'))
filtered_words = [w for w in words if not w in stop_words]
a = Counter([x.lower() for y in filtered_words for x in y.split()])
b = (a.most_common())
makeaframe = pd.DataFrame(b)
makeaframe.columns = ['Words', 'Frequency']
makeaframe.head()
输出
Words Frequency
0 credit 389
1 card 332
2 rewards 257
3 i 245
4 e 225