如何抓取多个网站以查找常用词(BeautifulSoup,Requests,Python3)

时间:2014-08-28 21:37:12

标签: python pandas beautifulsoup

我想知道如何使用漂亮的汤/请求抓取多个不同的网站,而不必一遍又一遍地重复我的代码。

这是我现在的代码:

import requests
from bs4 import BeautifulSoup
from collections import Counter
import pandas as pd
Website1 = requests.get("http://www.nerdwallet.com/the-best-credit-cards")
soup = BeautifulSoup(Website1.content)
texts = soup.findAll(text=True)
a = Counter([x.lower() for y in texts for x in y.split()])
b = (a.most_common())
makeaframe = pd.DataFrame(b)
makeaframe.columns = ['Words', 'Frequency']
print(makeaframe)

我想要做的事情理想地抓取5个不同的网站,查找这些网站上的所有单词,找到每个网站上每个单词的频率,为每个网站添加所有频率特别是,然后将所有这些数据组合成一个可以使用Pandas导出的数据帧。

希望输出看起来像这样

Word     Frequency
the       200
man       300
is        400
tired     300

我的代码目前只能在一个网站上执行此操作,而我正在尝试避免重复我的代码。

现在,我可以手动执行此操作,一遍又一遍地重复我的代码并抓取每个网站,然后将每个数据框的结果连接在一起,但这看起来非常单一。我想知道是否有人有更快的方式或任何建议?谢谢!

2 个答案:

答案 0 :(得分:2)

创建一个功能:

import requests
from bs4 import BeautifulSoup
from collections import Counter
import pandas as pd

cnt = Counter()
def GetData(url):
 Website1 = requests.get(url)
 soup = BeautifulSoup(Website1.content)
 texts = soup.findAll(text=True)
 a = Counter([x.lower() for y in texts for x in y.split()])
 cnt.update(a.most_common())

websites = ['http://www.nerdwallet.com/the-best-credit-cards','http://www.other.com']
for url in websites:
 GetData(url)

makeaframe = pd.DataFrame(cnt.most_common())
makeaframe.columns = ['Words', 'Frequency']
print makeaframe

答案 1 :(得分:1)

只需循环并更新主计数器字典:

main_c = Counter() # keep all results here
urls = ["http://www.nerdwallet.com/the-best-credit-cards","http://stackoverflow.com/questions/tagged/python"]
for url in urls:
    website = requests.get(url)
    soup = BeautifulSoup(website.content)
    texts = soup.findAll(text=True)
    a = Counter([x.lower() for y in texts for x in y.split()])
    b = (a.most_common())
    main_c.update(b)
make_a_frame = pd.DataFrame(main_c.most_common())
make_a_frame.columns = ['Words', 'Frequency']
print(make_a_frame)

与普通update不同的dict.update方法会增加值,但不会替换值

在样式注释中,对变量名使用小写,并使用下划线make_a_frame

尝试:

comm = [[k,v] for k,v in main_c]
make_a_frame = pd.DataFrame(comm)
make_a_frame.columns = ['Words', 'Frequency']
print(make_a_frame).sort("Frequency",ascending=False)