Question

我正在寻找从网页中抓取数据的方法。

https://www.industrynet.com/companies/

我计划从此站点获取每个公司的名称和位置。我想我需要以某种方式遍历每个页面，但是如果在另一个页面中，我不确定该怎么做。

我只能轻松浏览单个页面，因此不胜感激。

Answer 1

您可以将您的抓取过程想象成一棵树，在该树上您可以浏览各个页面分支。因此，在一些粗略的伪代码中，它看起来像这样：

    company_details = {}
    request the landing page and parse
    for letter_href in landing_page:
        scrape the company_code URL and parse
        company_code = some_code_you_scraped
        for company_href in company_code_page:
            scrape the company page URL and parse
            append each company info to the company_details dictionary including the company_code you grabbed from the previous page.

希望这会有所帮助！

如何遍历嵌套网页进行网页抓取？

1 个答案: