无法用尽我的刮刀

时间:2018-05-30 18:42:10

标签: python python-3.x web-scraping beautifulsoup

我使用BeautifulSoup库在python中编写了一个scraper来解析遍历网站不同页面的所有名称。如果不是一个不同分页的网址,我可以管理它,这意味着一些网址有一些不分页,因为内容很少。

我的问题是:如何在一个函数中编译它们来处理它们是否有分页?

我的初步尝试(它只能从每个网址的第一页解析内容):

import requests 
from bs4 import BeautifulSoup

urls = {
    'https://www.mobilehome.net/mobile-home-park-directory/maine/all',
    'https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all',
    'https://www.mobilehome.net/mobile-home-park-directory/new-hampshire/all',
    'https://www.mobilehome.net/mobile-home-park-directory/vermont/all'
}

def get_names(link):
    r = requests.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    for items in soup.select("td[class='table-row-price']"):
        name = items.select_one("h2 a").text
        print(name)

if __name__ == '__main__':
    for url in urls:
        get_names(url)

如果有一个像下面这样有分页的网址,我本可以设法做到这一切:

from bs4 import BeautifulSoup 
import requests

page_no = 0
page_link = "https://www.mobilehome.net/mobile-home-park-directory/new-hampshire/all/page/{}"

while True:
    page_no+=1
    res = requests.get(page_link.format(page_no))
    soup = BeautifulSoup(res.text,'lxml')
    container = soup.select("td[class='table-row-price']")
    if len(container)<=1:break 

    for content in container:
        title = content.select_one("h2 a").text
        print(title)

但是,所有的网址都没有分页。那么,无论是否有任何分页,我如何设法抓住所有这些?

2 个答案:

答案 0 :(得分:2)

此解决方案尝试查找分页a标记。如果找到任何分页,则当用户遍历类PageScraper的实例时,将抓取所有页面。如果没有,则只会抓取第一个结果(单个页面):

import requests
from bs4 import BeautifulSoup as soup
import contextlib
def has_pagination(f):
  def wrapper(cls):
     if not cls._pages:
       raise ValueError('No pagination found')
     return f(cls)
  return wrapper

class PageScraper:
   def __init__(self, url:str):
     self.url = url
     self._home_page = requests.get(self.url).text
     self._pages = [i.text for i in soup(self._home_page, 'html.parser').find('div', {'class':'pagination'}).find_all('a')][:-1]
   @property
   def first_page(self):
      return [i.find('h2', {'class':'table-row-heading'}).text for i in soup(self._home_page, 'html.parser').find_all('td', {'class':'table-row-price'})]
   @has_pagination
   def __iter__(self):
     for p in self._pages:
        _link = requests.get(f'{self.url}/page/{p}').text
        yield [i.find('h2', {'class':'table-row-heading'}).text for i in soup(_link, 'html.parser').find_all('td', {'class':'table-row-price'})]
   @classmethod
   @contextlib.contextmanager
   def feed_link(cls, link):
      results = cls(link)
      try:
        yield results.first_page
        for i in results:
          yield i
      except:
         yield results.first_page

类的构造函数将找到任何分页,__iter__方法只有在找到分页链接时才能获取所有页面。例如,https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all没有分页。因此:

r = PageScraper('https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all')
pages = [i for i in r]
  

ValueError:找不到分页

但是,可以找到第一页内容:

print(r.first_page)
['Forest Park MHP', 'Gansett Mobile Home Park', 'Meadowlark Park', 'Indian Cedar Mobile Homes Inc', 'Sherwood Valley Adult Mobile', 'Tripp Mobile Home Park', 'Ramblewood Estates', 'Countryside Trailer Park', 'Village At Wordens Pond', 'Greenwich West Inc', 'Dadson Mobile Home Estates', "Oliveira's Garage", 'Tuckertown Village Clubhouse', 'Westwood Estates']

但是,对于具有完整分页的页面,可以删除所有生成的页面:

r = PageScraper('https://www.mobilehome.net/mobile-home-park-directory/maine/all')
d = [i for i in r]

PageScraper.feed_link将自动完成此检查,并输出第一页,所有后续结果应该找到分页,如果结果中不存在分页,则只输出第一页:

urls = {'https://www.mobilehome.net/mobile-home-park-directory/maine/all', 'https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all', 'https://www.mobilehome.net/mobile-home-park-directory/vermont/all', 'https://www.mobilehome.net/mobile-home-park-directory/new-hampshire/all'}
for url in urls:
   with PageScraper.feed_link(url) as r:
      print(r)

答案 1 :(得分:2)

我似乎发现了一个非常强大的解决这个问题的解决方案。该方法是迭代的。它将首先检查该页面中是否有可用的next page网址。如果找到一个,那么它将跟踪该URL并重复该过程。但是,如果任何链接没有任何分页,则刮刀将会中断并尝试其他链接。

这是:

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

urls = [
        'https://www.mobilehome.net/mobile-home-park-directory/alaska/all',
        'https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all',
        'https://www.mobilehome.net/mobile-home-park-directory/maine/all',
        'https://www.mobilehome.net/mobile-home-park-directory/vermont/all'
    ]

def get_names(link):
    while True:
        r = requests.get(link)
        soup = BeautifulSoup(r.text,"lxml")
        for items in soup.select("td[class='table-row-price']"):
            name = items.select_one("h2 a").text
            print(name)

        nextpage = soup.select_one(".pagination a.next_page")

        if not nextpage:break  #If no pagination url is there, it will break and try another link

        link = urljoin(link,nextpage.get("href"))

if __name__ == '__main__':
    for url in urls:
        get_names(url)