如何收集来自"查看更多广告系列的链接"使用Python 3?我希望从此页面收集所有260604个链接? https://www.gofundme.com/mvc.php?route=category&term=sport
答案 0 :(得分:2)
点击View More Campaigns
按钮后,浏览器会请求以下网址:
https://www.gofundme.com/mvc.php?route=category/loadMoreTiles&page=2&term=sport&country=GB&initialTerm=
这可用于请求更多页面,如下所示:
from bs4 import BeautifulSoup
import requests
page = 1
links = set()
length = 0
while True:
print("Page {}".format(page))
gofundme = requests.get('https://www.gofundme.com/mvc.php?route=category/loadMoreTiles&page={}&term=sport&country=GB&initialTerm='.format(page))
soup = BeautifulSoup(gofundme.content, "html.parser")
links.update([a['href'] for a in soup.find_all('a', href=True)])
# Stop when no new links are found
if len(links) == length:
break
length = len(links)
page += 1
for link in sorted(links):
print(link)
给你输出如下:
https://www.gofundme.com/100-round-kumite-rundraiser
https://www.gofundme.com/10k-challenge-for-disabled-sports
https://www.gofundme.com/1yeti0
https://www.gofundme.com/2-marathons-1-month
https://www.gofundme.com/23yq67t4
https://www.gofundme.com/2fwyuwvg
返回的部分链接是重复的,因此使用set
来避免这种情况。
该脚本继续尝试请求新页面,直到看不到新链接,这似乎发生在大约18页。
答案 1 :(得分:1)
来自retrieve links from web page using python and BeautifulSoup
import httplib2 from bs4 import BeautifulSoup, SoupStrainer http = httplib2.Http() status, response = http.request('https://www.gofundme.com/mvc.php?route=category&term=sport') for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')): if link.has_attr('href'): print (link['href'])