Question

我可以抓住这个网站的第一页：

http://ratings.food.gov.uk/enhanced-search/en-GB/%5E/London/Relevance/0/%5E/%5E/0/1/10

但是我试图通过使用网站分页中的“下一步”按钮来抓取网站上的所有其他页面。

我点击了下一步按钮，我可以看到更改的参数是从第0页的0/1/10到0/2/10等等。

我查看了分页代码，我可以看到分页是在Div

中

 <div id="pagingNext" class="link-wrapper">

问题是我只使用以下代码成功地从其他网站上删除了分页：

button_next = soup.find("a", {"class": "btn paging-next"}, href=True)
while button_next:
    time.sleep(2)#delay time requests are sent so we don't get kicked by server
    soup=makesoup(url = "https://www.propertypal.com{0}".format(button_next["href"]))

这很有效，但是因为我正在抓取的这个网站似乎没有提供下一个按钮URL的href我迷失了如何尝试并刮掉它

我试过了：

button_next = soup.find("div", {"class": "paging-Next"})
while button_next:
    time.sleep(2)#delay time requests are sent so we don't get kicked by server
    soup=makesoup(url = "https://www.propertypal.com{0}".format(button_next))

但它似乎并没有刮掉其他页面，只是第一页仍在。

如果有人能提供帮助，我会非常感激。

由于

Answer 1

解决方法：

当您在检查True按钮时Next，您可以手动创建链接并通过递增数字尾部在循环中打开它们，就像您写的：从0/1/10到0 /第2页的2/10，依此类推。

类似的东西：

base_ur = 'http://ratings.food.gov.uk/enhanced-search/en-GB/%5E/London/Relevance/0/%5E/%5E/0/' # deleting 1/10

incr = 0
while button_next:
    incr+=1
    next_url = base_url + str(incr)+'/10'
    page = urllib.requests.urlopen(next_url)
    (and then scraping goes)

Answer 2

由于您已经知道网址在所有网页上的变化情况，因此无需验证button_next网址。所以，而不是使用网址＆＃34; http://ratings.food.gov.uk/enhanced-search/en-GB/%5E/London/Relevance/0/%5E/%5E/0/1/10＆＃34;我建议使用＆＃34; http://ratings.food.gov.uk/enhanced-search/en-GB/%5E/London/Relevance/0/%5E/%5E/0/1/50＆＃34;，该网站提供了此选项，可以同时查看50个项目，因此您不会遍历4044，而是只能浏览809页。

在while循环中，我们等待current为810，因此我们知道最后一页已被删除，因为通过检查，/809/50是最后一页。

import requests
from bs4 import BeautifulSoup

current = 0
while current < 810:  # Last page, by inspection is /809/50
    url = "http://ratings.food.gov.uk/enhanced-search/en-GB/%5E/London/Relevance/0/%5E/%5E/0/{:d}/50".format(current)
    data = requests.get(url).text 
    soup = BeautifulSoup(data, "html.parser")
    print(url)
    current += 1
    #  Do your scraping here

Answer 3

在这种情况下，这是最好的方法来耗尽所有页面，甚至不知道它已经传播了多少页面，如先前t.m.adam已经提到的那样。试一试。它会给你所有的名字。

map

（Python 3，BeautifulSoup 4） - Div中的Scraping Pagination

3 个答案: