Question

如何收集来自＆＃34;查看更多广告系列的链接＆＃34;使用Python 3？我希望从此页面收集所有260604个链接？ https://www.gofundme.com/mvc.php?route=category&term=sport

Answer 1

点击View More Campaigns按钮后，浏览器会请求以下网址：

https://www.gofundme.com/mvc.php?route=category/loadMoreTiles&page=2&term=sport&country=GB&initialTerm=

这可用于请求更多页面，如下所示：

from bs4 import BeautifulSoup    
import requests

page = 1
links = set()
length = 0

while True:
    print("Page {}".format(page))
    gofundme = requests.get('https://www.gofundme.com/mvc.php?route=category/loadMoreTiles&page={}&term=sport&country=GB&initialTerm='.format(page))
    soup = BeautifulSoup(gofundme.content, "html.parser")
    links.update([a['href'] for a in soup.find_all('a', href=True)])

    # Stop when no new links are found
    if len(links) == length:
        break

    length = len(links)
    page += 1

for link in sorted(links):
    print(link)

给你输出如下：

https://www.gofundme.com/100-round-kumite-rundraiser
https://www.gofundme.com/10k-challenge-for-disabled-sports
https://www.gofundme.com/1yeti0
https://www.gofundme.com/2-marathons-1-month
https://www.gofundme.com/23yq67t4
https://www.gofundme.com/2fwyuwvg

返回的部分链接是重复的，因此使用set来避免这种情况。该脚本继续尝试请求新页面，直到看不到新链接，这似乎发生在大约18页。

Answer 2

来自retrieve links from web page using python and BeautifulSoup

import httplib2
from bs4 import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('https://www.gofundme.com/mvc.php?route=category&term=sport')

for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
    if link.has_attr('href'):
        print (link['href'])

如何从网页收集所有链接？

2 个答案: