使用Python进行抓取时循环浏览网页(例如Google搜索)

时间:2020-01-24 16:47:39

标签: python web-scraping request

我正在尝试刮除一个房地产网站。抓取代码运行良好,但是我遇到了问题:

当我在这个网站(看起来像Google)上进行研究时,会显示很多页面。

我如何计算搜索到的页面数量,而不是全部删除?

该站点是:https://www.zapimoveis.com.br/aluguel/predio-inteiro/?transacao=Aluguel&tipoUnidade=Comercial,Pr%C3%A9dio%20Inteiro&tipo=Im%C3%B3vel%20usado

如您所见,当我进行简单搜索时,它会在顶部显示我:

“ 10.177prédiosinteiros para alugar”表示“可出租的10.177建筑物”。

该网站的底部显示了一些已找到的页面,我想将所有页面都报废。

这是我的代码,我必须像在搜索的每一页下方一样进行抓取。它的工作原理是抓取租金,平方米等数据:

import pandas as pd 
from bs4 import BeautifulSoup
import requests



headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}

url = requests.get("https://www.zapimoveis.com.br/aluguel/predio-inteiro/?transacao=Aluguel&tipoUnidade=Comercial,Pr%C3%A9dio%20Inteiro&tipo=Im%C3%B3vel%20usado", headers = headers)

if url.status_code == 200:
    soup = BeautifulSoup(url.content, "html.parser")
soup = BeautifulSoup(url.content, "html.parser")

Aluguel = [headline.get_text() for headline in soup.find_all("p", {"class": "simple-card__price"})]

AluguelFixed = list(map(int, [i.replace('.', '').replace("R$", "").replace("mês", "").replace("\n","").replace("/","").strip() for i in Aluguel]))

Metragem = [li.find("span", recursive=False).get_text() for li in soup.find_all("li", {"class": "feature__item" }) ]

MetragemAjustada = list(map(int, [i.replace('m²', '').strip() for i in Metragem if 'm²' in i]))

BancoDeDados = pd.DataFrame(data={"col1": MetragemAjustada, "col2": AluguelFixed})

BancoDeDados.to_csv("C:\\Users\\fernando.rezende\\OneDrive - ES Ltda\\Área de Trabalho/RobozapDataFrame.csv", sep=',',index=False) ```


0 个答案:

没有答案