Question

我正在做python抓取操作，我试图获取href标记之间的所有链接，然后一个个地访问它以从这些链接中抓取数据。我是新手，无法弄清楚如何继续。代码如下：

 import requests
    import urllib.request
    import re
    from bs4 import BeautifulSoup
    import csv

    url = 'https://menupages.com/restaurants/ny-new-york'
    url1 = 'https://menupages.com'
    response = requests.get(url)
    f = csv.writer(open('Restuarants_details.csv', 'w'))

    soup = BeautifulSoup(response.text, "html.parser")

    menu_sections=[]
    for url2 in soup.find_all('h3',class_='restaurant__title'):
    completeurl = url1+url2.a.get('href')
    print(completeurl)

    #print(url)

Answer 1

如果您要抓取从首页获得的所有链接，然后再抓取从这些链接获得的所有链接，依此类推，则需要递归函数。

以下是一些入门代码，可以帮助您入门：

if __name__ == "__main__":
    initial_url = "https://menupages.com/restaurants/ny-new-york"
    scrape(initial_url)

def scrape(url):
    print("now looking at " + url)
    # scrape URL
    # do something with the data

    if (STOP_CONDITION):  # update this!
        return

    # scrape new URLs:
    for new_url in soup.find_all(...):
        scrape(new_url, file)

此递归函数的问题是，直到页面上没有链接，它才会停止，这可能不会很快发生。您将需要添加停止条件。

刮除href链接并从这些链接中抓取

1 个答案: