如何在网站内找到要抓取的网址路径的完整列表?

时间:2020-07-27 08:49:27

标签: web-scraping beautifulsoup

是否可以使用python查看要抓取的网站的url路径的完整列表?

网址的结构不仅会更改路径:

https://www.broadsheet.com.au/{city}/guides/best-cafes-{area}

现在,我有一个函数,允许我使用f字符串文字定义{city}{area},但我必须手动执行此操作。例如:city = melbournearea = fitzroy

我想尝试让函数遍历所有可用路径,但我需要弄清楚如何获取路径的完整列表。

刮刀能做到吗?

2 个答案:

答案 0 :(得分:1)

您可以解析站点地图以获取所需的URL,例如:

import requests
from bs4 import BeautifulSoup


url = 'https://www.broadsheet.com.au/sitemap'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for loc in soup.select('loc'):
    if not loc.text.strip().endswith('/guide'):
        continue
    soup2 = BeautifulSoup(requests.get(loc.text).content, 'html.parser')
    for loc2 in soup2.select('loc'):
        if '/best-cafes-' in loc2.text:
            print(loc2.text)

打印:

https://www.broadsheet.com.au/melbourne/guides/best-cafes-st-kilda
https://www.broadsheet.com.au/melbourne/guides/best-cafes-fitzroy
https://www.broadsheet.com.au/melbourne/guides/best-cafes-balaclava
https://www.broadsheet.com.au/melbourne/guides/best-cafes-preston
https://www.broadsheet.com.au/melbourne/guides/best-cafes-seddon
https://www.broadsheet.com.au/melbourne/guides/best-cafes-northcote
https://www.broadsheet.com.au/melbourne/guides/best-cafes-fairfield
https://www.broadsheet.com.au/melbourne/guides/best-cafes-ascot-vale
https://www.broadsheet.com.au/melbourne/guides/best-cafes-west-melbourne
https://www.broadsheet.com.au/melbourne/guides/best-cafes-flemington
https://www.broadsheet.com.au/melbourne/guides/best-cafes-windsor
https://www.broadsheet.com.au/melbourne/guides/best-cafes-kensington
https://www.broadsheet.com.au/melbourne/guides/best-cafes-prahran
https://www.broadsheet.com.au/melbourne/guides/best-cafes-essendon
https://www.broadsheet.com.au/melbourne/guides/best-cafes-pascoe-vale
https://www.broadsheet.com.au/melbourne/guides/best-cafes-albert-park
https://www.broadsheet.com.au/melbourne/guides/best-cafes-port-melbourne
https://www.broadsheet.com.au/melbourne/guides/best-cafes-armadale
https://www.broadsheet.com.au/melbourne/guides/best-cafes-brighton
https://www.broadsheet.com.au/melbourne/guides/best-cafes-malvern
https://www.broadsheet.com.au/melbourne/guides/best-cafes-malvern-east
https://www.broadsheet.com.au/melbourne/guides/best-cafes-glen-iris
https://www.broadsheet.com.au/melbourne/guides/best-cafes-camberwell
https://www.broadsheet.com.au/melbourne/guides/best-cafes-hawthorn-east
https://www.broadsheet.com.au/melbourne/guides/best-cafes-brunswick-east
https://www.broadsheet.com.au/melbourne/guides/best-cafes-bentleigh
https://www.broadsheet.com.au/melbourne/guides/best-cafes-coburg
https://www.broadsheet.com.au/melbourne/guides/best-cafes-richmond
https://www.broadsheet.com.au/melbourne/guides/best-cafes-bentleigh-east
https://www.broadsheet.com.au/melbourne/guides/best-cafes-collingwood
https://www.broadsheet.com.au/melbourne/guides/best-cafes-elwood
https://www.broadsheet.com.au/melbourne/guides/best-cafes-abbotsford
https://www.broadsheet.com.au/melbourne/guides/best-cafes-south-yarra
https://www.broadsheet.com.au/melbourne/guides/best-cafes-yarraville
https://www.broadsheet.com.au/melbourne/guides/best-cafes-thornbury
https://www.broadsheet.com.au/melbourne/guides/best-cafes-west-footscray
https://www.broadsheet.com.au/melbourne/guides/best-cafes-footscray
https://www.broadsheet.com.au/melbourne/guides/best-cafes-south-melbourne
https://www.broadsheet.com.au/melbourne/guides/best-cafes-hawthorn
https://www.broadsheet.com.au/melbourne/guides/best-cafes-carlton-north
https://www.broadsheet.com.au/melbourne/guides/best-cafes-brunswick
https://www.broadsheet.com.au/melbourne/guides/best-cafes-carlton
https://www.broadsheet.com.au/melbourne/guides/best-cafes-elsternwick
https://www.broadsheet.com.au/sydney/guides/best-cafes-bronte
https://www.broadsheet.com.au/sydney/guides/best-cafes-coogee
https://www.broadsheet.com.au/sydney/guides/best-cafes-rosebery
https://www.broadsheet.com.au/sydney/guides/best-cafes-ultimo
https://www.broadsheet.com.au/sydney/guides/best-cafes-enmore
https://www.broadsheet.com.au/sydney/guides/best-cafes-dulwich-hill
https://www.broadsheet.com.au/sydney/guides/best-cafes-leichhardt
https://www.broadsheet.com.au/sydney/guides/best-cafes-glebe
https://www.broadsheet.com.au/sydney/guides/best-cafes-annandale
https://www.broadsheet.com.au/sydney/guides/best-cafes-rozelle
https://www.broadsheet.com.au/sydney/guides/best-cafes-paddington
https://www.broadsheet.com.au/sydney/guides/best-cafes-balmain
https://www.broadsheet.com.au/sydney/guides/best-cafes-erskineville
https://www.broadsheet.com.au/sydney/guides/best-cafes-willoughby
https://www.broadsheet.com.au/sydney/guides/best-cafes-bondi-junction
https://www.broadsheet.com.au/sydney/guides/best-cafes-north-sydney
https://www.broadsheet.com.au/sydney/guides/best-cafes-bondi
https://www.broadsheet.com.au/sydney/guides/best-cafes-potts-point
https://www.broadsheet.com.au/sydney/guides/best-cafes-mosman
https://www.broadsheet.com.au/sydney/guides/best-cafes-alexandria
https://www.broadsheet.com.au/sydney/guides/best-cafes-crows-nest
https://www.broadsheet.com.au/sydney/guides/best-cafes-manly
https://www.broadsheet.com.au/sydney/guides/best-cafes-woolloomooloo
https://www.broadsheet.com.au/sydney/guides/best-cafes-newtown
https://www.broadsheet.com.au/sydney/guides/best-cafes-vaucluse
https://www.broadsheet.com.au/sydney/guides/best-cafes-chippendale
https://www.broadsheet.com.au/sydney/guides/best-cafes-marrickville
https://www.broadsheet.com.au/sydney/guides/best-cafes-redfern
https://www.broadsheet.com.au/sydney/guides/best-cafes-camperdown
https://www.broadsheet.com.au/sydney/guides/best-cafes-darlinghurst
https://www.broadsheet.com.au/adelaide/guides/best-cafes-goodwood
https://www.broadsheet.com.au/perth/guides/best-cafes-northbridge
https://www.broadsheet.com.au/perth/guides/best-cafes-leederville

答案 1 :(得分:0)

本质上,您正试图像搜索引擎一样创建蜘蛛。那么,为什么不使用现有的呢?每天最多免费提供100个查询。您将必须设置Google自定义搜索并定义搜索查询。

  1. 从此处获取您的API密钥:https://developers.google.com/custom-search/v1/introduction/?apix=true
  2. 定义新的搜索引擎:使用URL https://cse.google.com/cse/allhttps://www.broadsheet.com.au/
  3. 单击public URL并从cx=123456:abcdef复制零件
  4. 将您的API密钥和cx-part放在URL google
  5. 调整以下查询以获取不同城市的结果。我设置它来查找墨尔本的结果,但是您可以在此处轻松地使用占位符并设置字符串格式。
import requests

google = 'https://www.googleapis.com/customsearch/v1?key={your_custom_search_key}&cx={your_custom_search_id}&q=site:https://www.broadsheet.com.au/melbourne/guides/best+%22best+cafes+in%22+%22melbourne%22&start={}'

results = []
with requests.Session() as session:
    start = 1
    while True:
        result = session.get(google.format(start)).json()
        if 'nextPage' in result['queries'].keys():
            start = result['queries']['nextPage'][0]['startIndex']
            print(start)
        else:
            break
        results += result['items']