如何在不知道有多少页的情况下抓取所有页面

时间:2020-06-23 21:26:12

标签: python web-scraping

我具有以下功能来收集所有价格,但在刮取总页数时遇到问题。在不知道页面数量的情况下,如何能够刮取所有页面?

import requests
from bs4 import BeautifulSoup
import pandas as pd
import itertools

def get_data(page):
    url = 'https://www.remax.ca/bc/vancouver--real-estate?page='+page
    page = requests.get(url)
    soup = BeautifulSoup(page,'html.parser')
    price = soup.find_all('h3', {'class' : 'price'})
    price = pd.DataFrame([(p.text) for p in price]).rename(columns = {0:'Price'})
    return price

我尝试过但似乎无法正常工作

for pages in itertools.count(start=1):
    try:
        table = get_data('1').append(table)
    except Exception:
        break

3 个答案:

答案 0 :(得分:2)

这是一个很好的递归机会,只要您不希望超过1000页,因为我认为Python仅允许最大堆栈深度为1000:

public static void AddWatermark(string fileName)
        {
            StringBuilder sb = new StringBuilder();
            sb.Append("watermark_text  ");
            sb.Append(DateTime.Now.ToString());

            string waterMarkText = sb.ToString();

            PDDocument origDoc = PDDocument.load(new java.io.File(fileName));
            PDPageTree allPages = origDoc.getPages();
            PDFont font = PDType1Font.HELVETICA_BOLD;
            for (int i = 0, len = allPages.getCount(); i < len; ++i)
            {
                PDPage pg = (PDPage)allPages.get(i);

                AddWatermarkText(origDoc, pg, font, waterMarkText);
            }

            origDoc.save(fileName);
            origDoc.close();
        }

static void AddWatermarkText(PDDocument doc, PDPage page, PDFont font, string text)
        {
            using (PDPageContentStream cs = new PDPageContentStream(doc, page, PDPageContentStream.AppendMode.APPEND, true, true))
            {
                float fontHeight = 30;
                float width = page.getMediaBox().getWidth();
                float height = page.getMediaBox().getHeight();
                float stringWidth = font.getStringWidth(text) / 1000 * fontHeight;

                float x = (width / 2) - (stringWidth / 2);
                float y = height - 25;

                cs.setFont(font, fontHeight);

                PDExtendedGraphicsState gs = new PDExtendedGraphicsState();
                gs.setNonStrokingAlphaConstant(new java.lang.Float(0.2f));
                gs.setStrokingAlphaConstant(new java.lang.Float(0.2f));
                gs.setBlendMode(BlendMode.MULTIPLY);
                gs.setLineWidth(new java.lang.Float(3f));
                cs.setGraphicsStateParameters(gs);

                cs.setNonStrokingColor(Color.red);
                cs.setStrokingColor(Color.red);

                cs.beginText();
                cs.newLineAtOffset(x, y);
                cs.showText(text);
                cs.endText();
            }
        }

因此,get_prices函数首先使用默认参数进行调用。然后,它会继续进行自我调用,并每次都将附加价格附加到prices函数中,直到到达下一页不显示状态码200的位置,或者达到您指定的最大递归深度为止。

或者,如果您不喜欢递归或者需要一次查询1000个以上的页面,则可以使用更简单但不太有趣的while循环:

import requests
from bs4 import BeautifulSoup
import pandas as pd

def get_prices(page=1, prices=[], depth=0, max_depth=100):

    if depth >= max_depth:
        return prices

    url = 'https://www.remax.ca/bc/vancouver--real-estate?page={page}'.format(page=page)
    
    r = requests.get(url)
    if not r:
        return prices
    if r.status_code != 200:
        return prices

    soup = BeautifulSoup(r.text, 'html.parser')
    price = soup.find_all('h3', {'class' : 'price'})
    price = pd.DataFrame([(p.text) for p in price]).rename(columns = {0:'Price'})

    prices.append(price)
    
    return get_prices(page=page+1, prices=prices, depth=depth+1)

prices = get_prices()

答案 1 :(得分:0)

尝试一下

def get_data(price, page):
    url = 'https://www.remax.ca/bc/vancouver--real-estate?page='+page
    page = urlopen(url)
    soup = BeautifulSoup(page,'html.parser')
    price = soup.find_all('h3', {'class' : 'price'})
    price = pd.DataFrame([(p.text) for p in price]).rename(columns = {0:'Price'})

price = dict()
for page in itertools.count(start=1):
    try:
        get_data(price, str(page))
    except Exception:
        break

答案 2 :(得分:-1)

也许您应该将“ get_data('1')”更改为“ get_data(str(page))”?