Python - 显示所有页面的结果,而不仅仅是第一页(美丽的汤)

时间:2017-02-15 13:58:08

标签: python python-3.x beautifulsoup screen-scraping

我一直在使用Beautiful Soup制作一个简单的刮刀,根据用户输入的邮政编码获得餐馆的食品卫生等级。代码正常工作并正确地从URL获取结果。

我需要帮助的是如何显示所有结果,而不仅仅是第一页的结果。

我的代码如下:

import requests
from bs4 import BeautifulSoup

pc = input("Please enter postcode")

url = "https://www.scoresonthedoors.org.uk/search.php?name=&address=&postcode="+pc+"&distance=1&search.x=8&search.y=6&gbt_id=0&award_score=&award_range=gt"
r = requests.get(url)

soup = BeautifulSoup(r.content, "lxml")
g_data = soup.findAll("div", {"class": "search-result"})

for item in g_data:
    print (item.find_all("a", {"class": "name"})[0].text)
try:
    print (item.find_all("span", {"class": "address"})[0].text)
except:
    pass
try:
    print (item.find_all("div", {"class": "rating-image"})[0].text)
except:
    pass

通过查看URL,我发现所显示的页面依赖于名为page

的URL字符串中的变量
https://www.scoresonthedoors.org.uk/search.php?award_sort=ALPHA&name=&address=BT147AL&x=0&y=0&page=2#results

“下一页”按钮的分页代码为:

<a style="float: right" href="?award_sort=ALPHA&amp;name=&amp;address=BT147AL&amp;x=0&amp;y=0&amp;page=3#results" rel="next " title="Go forward one page">Next <i class="fa fa-arrow-right fa-3"></i></a>

有没有办法可以让我的代码找出有多少页面的结果,然后从每个页面中获取结果?

对此最好的解决方案是使代码改变URL字符串每次更改“page =”(例如for循环)还是有办法使用分页链接代码中的信息找到解决方案?

非常感谢任何提供帮助或关注此问题的人

1 个答案:

答案 0 :(得分:1)

你实际上是正确的方式。生成分页网址预先刮去是一个很好的方法。

我实际上几乎写了整个代码。您要查看的是find_max_page()函数,它首先包含从分页字符串中获取最大页面。使用此数字,您可以生成所需的所有网址,并逐个抓取。

检查下面的代码,它几乎都在那里。

import requests
from bs4 import BeautifulSoup


class RestaurantScraper(object):

    def __init__(self, pc):
        self.pc = pc        # the input postcode
        self.max_page = self.find_max_page()        # The number of page available
        self.restaurants = list()       # the final list of restaurants where the scrape data will at the end of process

    def run(self):
        for url in self.generate_pages_to_scrape():
            restaurants_from_url = self.scrape_page(url)
            self.restaurants += restaurants_from_url     # we increment the  restaurants to the global restaurants list

    def create_url(self):
        """
        Create a core url to scrape
        :return: A url without pagination (= page 1)
        """
        return "https://www.scoresonthedoors.org.uk/search.php?name=&address=&postcode=" + self.pc + \
               "&distance=1&search.x=8&search.y=6&gbt_id=0&award_score=&award_range=gt"

    def create_paginated_url(self, page_number):
        """
        Create a paginated url
        :param page_number: pagination (integer)
        :return: A url paginated
        """
        return self.create_url() + "&page={}".format(str(page_number))

    def find_max_page(self):
        """
        Function to find the number of pages for a specific search.
        :return: The number of pages (integer)
        """
        r = requests.get(self.create_url())
        soup = BeautifulSoup(r.content, "lxml")
        pagination_soup = soup.findAll("div", {"id": "paginator"})
        pagination = pagination_soup[0]
        page_text = pagination("p")[0].text
        return int(page_text.replace('Page 1 of ', ''))

    def generate_pages_to_scrape(self):
        """
        Generate all the paginated url using the max_page attribute previously scraped.
        :return: List of urls
        """
        return [self.create_paginated_url(page_number) for page_number in range(1, self.max_page + 1)]

    def scrape_page(self, url):
        """
        This is coming from your original code snippet. This probably need a bit of work, but you get the idea.
        :param url: Url to scrape and get data from.
        :return:
        """
        r = requests.get(url)
        soup = BeautifulSoup(r.content, "lxml")
        g_data = soup.findAll("div", {"class": "search-result"})

        restaurants = list()
        for item in g_data:
            name = item.find_all("a", {"class": "name"})[0].text
            restaurants.append(name)
            try:
                print item.find_all("span", {"class": "address"})[0].text
            except:
                pass
            try:
                print item.find_all("div", {"class": "rating-image"})[0].text
            except:
                pass
        return restaurants


if __name__ == '__main__':
    pc = input('Give your post code')
    scraper = RestaurantScraper(pc)
    scraper.run()
    print "{} restaurants scraped".format(str(len(scraper.restaurants)))