如何从网页收集所有链接?

时间:2017-11-22 19:57:12

标签: python python-3.x web-scraping beautifulsoup

如何收集来自"查看更多广告系列的链接"使用Python 3?我希望从此页面收集所有260604个链接? https://www.gofundme.com/mvc.php?route=category&term=sport

2 个答案:

答案 0 :(得分:2)

点击View More Campaigns按钮后,浏览器会请求以下网址:

https://www.gofundme.com/mvc.php?route=category/loadMoreTiles&page=2&term=sport&country=GB&initialTerm=

这可用于请求更多页面,如下所示:

from bs4 import BeautifulSoup    
import requests

page = 1
links = set()
length = 0

while True:
    print("Page {}".format(page))
    gofundme = requests.get('https://www.gofundme.com/mvc.php?route=category/loadMoreTiles&page={}&term=sport&country=GB&initialTerm='.format(page))
    soup = BeautifulSoup(gofundme.content, "html.parser")
    links.update([a['href'] for a in soup.find_all('a', href=True)])

    # Stop when no new links are found
    if len(links) == length:
        break

    length = len(links)
    page += 1

for link in sorted(links):
    print(link)

给你输出如下:

https://www.gofundme.com/100-round-kumite-rundraiser
https://www.gofundme.com/10k-challenge-for-disabled-sports
https://www.gofundme.com/1yeti0
https://www.gofundme.com/2-marathons-1-month
https://www.gofundme.com/23yq67t4
https://www.gofundme.com/2fwyuwvg

返回的部分链接是重复的,因此使用set来避免这种情况。 该脚本继续尝试请求新页面,直到看不到新链接,这似乎发生在大约18页。

答案 1 :(得分:1)

来自retrieve links from web page using python and BeautifulSoup

import httplib2
from bs4 import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('https://www.gofundme.com/mvc.php?route=category&term=sport')

for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
    if link.has_attr('href'):
        print (link['href'])