Question

我试图制作一个抓取一定数量网页的网页抓取工具，但它只抓取第一页，并将其打印的次数与我想要抓取的网页数量相同。

def web_spider (max_pages):
page = 1
while page <= max_pages:
    url = 'http://www.forbes.com/global2000/list/#page:' + str(page) + '_sort:0_direction:asc_search:_filter:All%20industries_' \
                                                                       'filter:All%20countries_filter:All%20states'
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)
    for link in soup.findAll('a'):
        if link.parent.name == 'td':
            href = link.get('href')
            x = href[11:len(href)-1]
            company_list.append(x)
    page += 1
print(page)
return company_list

编辑：做到了另一种方式。

Answer 1

如果需要数据集，可以通过单击“记录网络流量”并使用刷新页面以查看表的填充方式，使用浏览器开发人员工具查找使用了哪些网络资源。在这种情况下，我找到了以下URL：

https://www.forbes.com/forbesapi/org/global2000/2020/position/true.json?limit=2000

这对您有帮助吗？

Python网络抓取工具不会抓取所有网页

1 个答案: