我具有以下功能来收集所有价格,但在刮取总页数时遇到问题。在不知道页面数量的情况下,如何能够刮取所有页面?
import requests
from bs4 import BeautifulSoup
import pandas as pd
import itertools
def get_data(page):
url = 'https://www.remax.ca/bc/vancouver--real-estate?page='+page
page = requests.get(url)
soup = BeautifulSoup(page,'html.parser')
price = soup.find_all('h3', {'class' : 'price'})
price = pd.DataFrame([(p.text) for p in price]).rename(columns = {0:'Price'})
return price
我尝试过但似乎无法正常工作
for pages in itertools.count(start=1):
try:
table = get_data('1').append(table)
except Exception:
break
答案 0 :(得分:2)
这是一个很好的递归机会,只要您不希望超过1000页,因为我认为Python仅允许最大堆栈深度为1000:
public static void AddWatermark(string fileName)
{
StringBuilder sb = new StringBuilder();
sb.Append("watermark_text ");
sb.Append(DateTime.Now.ToString());
string waterMarkText = sb.ToString();
PDDocument origDoc = PDDocument.load(new java.io.File(fileName));
PDPageTree allPages = origDoc.getPages();
PDFont font = PDType1Font.HELVETICA_BOLD;
for (int i = 0, len = allPages.getCount(); i < len; ++i)
{
PDPage pg = (PDPage)allPages.get(i);
AddWatermarkText(origDoc, pg, font, waterMarkText);
}
origDoc.save(fileName);
origDoc.close();
}
static void AddWatermarkText(PDDocument doc, PDPage page, PDFont font, string text)
{
using (PDPageContentStream cs = new PDPageContentStream(doc, page, PDPageContentStream.AppendMode.APPEND, true, true))
{
float fontHeight = 30;
float width = page.getMediaBox().getWidth();
float height = page.getMediaBox().getHeight();
float stringWidth = font.getStringWidth(text) / 1000 * fontHeight;
float x = (width / 2) - (stringWidth / 2);
float y = height - 25;
cs.setFont(font, fontHeight);
PDExtendedGraphicsState gs = new PDExtendedGraphicsState();
gs.setNonStrokingAlphaConstant(new java.lang.Float(0.2f));
gs.setStrokingAlphaConstant(new java.lang.Float(0.2f));
gs.setBlendMode(BlendMode.MULTIPLY);
gs.setLineWidth(new java.lang.Float(3f));
cs.setGraphicsStateParameters(gs);
cs.setNonStrokingColor(Color.red);
cs.setStrokingColor(Color.red);
cs.beginText();
cs.newLineAtOffset(x, y);
cs.showText(text);
cs.endText();
}
}
因此,get_prices函数首先使用默认参数进行调用。然后,它会继续进行自我调用,并每次都将附加价格附加到prices函数中,直到到达下一页不显示状态码200的位置,或者达到您指定的最大递归深度为止。
或者,如果您不喜欢递归或者需要一次查询1000个以上的页面,则可以使用更简单但不太有趣的while循环:
import requests
from bs4 import BeautifulSoup
import pandas as pd
def get_prices(page=1, prices=[], depth=0, max_depth=100):
if depth >= max_depth:
return prices
url = 'https://www.remax.ca/bc/vancouver--real-estate?page={page}'.format(page=page)
r = requests.get(url)
if not r:
return prices
if r.status_code != 200:
return prices
soup = BeautifulSoup(r.text, 'html.parser')
price = soup.find_all('h3', {'class' : 'price'})
price = pd.DataFrame([(p.text) for p in price]).rename(columns = {0:'Price'})
prices.append(price)
return get_prices(page=page+1, prices=prices, depth=depth+1)
prices = get_prices()
答案 1 :(得分:0)
尝试一下
def get_data(price, page):
url = 'https://www.remax.ca/bc/vancouver--real-estate?page='+page
page = urlopen(url)
soup = BeautifulSoup(page,'html.parser')
price = soup.find_all('h3', {'class' : 'price'})
price = pd.DataFrame([(p.text) for p in price]).rename(columns = {0:'Price'})
price = dict()
for page in itertools.count(start=1):
try:
get_data(price, str(page))
except Exception:
break
答案 2 :(得分:-1)
也许您应该将“ get_data('1')”更改为“ get_data(str(page))”?