我想知道如何使用漂亮的汤/请求抓取多个不同的网站,而不必一遍又一遍地重复我的代码。
这是我现在的代码:
import requests
from bs4 import BeautifulSoup
from collections import Counter
import pandas as pd
Website1 = requests.get("http://www.nerdwallet.com/the-best-credit-cards")
soup = BeautifulSoup(Website1.content)
texts = soup.findAll(text=True)
a = Counter([x.lower() for y in texts for x in y.split()])
b = (a.most_common())
makeaframe = pd.DataFrame(b)
makeaframe.columns = ['Words', 'Frequency']
print(makeaframe)
我想要做的事情理想地抓取5个不同的网站,查找这些网站上的所有单词,找到每个网站上每个单词的频率,为每个网站添加所有频率特别是,然后将所有这些数据组合成一个可以使用Pandas导出的数据帧。
希望输出看起来像这样
Word Frequency
the 200
man 300
is 400
tired 300
我的代码目前只能在一个网站上执行此操作,而我正在尝试避免重复我的代码。
现在,我可以手动执行此操作,一遍又一遍地重复我的代码并抓取每个网站,然后将每个数据框的结果连接在一起,但这看起来非常单一。我想知道是否有人有更快的方式或任何建议?谢谢!
答案 0 :(得分:2)
创建一个功能:
import requests
from bs4 import BeautifulSoup
from collections import Counter
import pandas as pd
cnt = Counter()
def GetData(url):
Website1 = requests.get(url)
soup = BeautifulSoup(Website1.content)
texts = soup.findAll(text=True)
a = Counter([x.lower() for y in texts for x in y.split()])
cnt.update(a.most_common())
websites = ['http://www.nerdwallet.com/the-best-credit-cards','http://www.other.com']
for url in websites:
GetData(url)
makeaframe = pd.DataFrame(cnt.most_common())
makeaframe.columns = ['Words', 'Frequency']
print makeaframe
答案 1 :(得分:1)
只需循环并更新主计数器字典:
main_c = Counter() # keep all results here
urls = ["http://www.nerdwallet.com/the-best-credit-cards","http://stackoverflow.com/questions/tagged/python"]
for url in urls:
website = requests.get(url)
soup = BeautifulSoup(website.content)
texts = soup.findAll(text=True)
a = Counter([x.lower() for y in texts for x in y.split()])
b = (a.most_common())
main_c.update(b)
make_a_frame = pd.DataFrame(main_c.most_common())
make_a_frame.columns = ['Words', 'Frequency']
print(make_a_frame)
与普通update
不同的dict.update
方法会增加值,但不会替换值
在样式注释中,对变量名使用小写,并使用下划线make_a_frame
尝试:
comm = [[k,v] for k,v in main_c]
make_a_frame = pd.DataFrame(comm)
make_a_frame.columns = ['Words', 'Frequency']
print(make_a_frame).sort("Frequency",ascending=False)