网页抓取无法获取所有数据

时间:2021-03-08 12:11:18

标签: python web-scraping

我尝试通过多种方式访问​​ orpi 的网站和我制作的每个程序,并且每个请求仅返回 HTML 上可用的数据,即导航栏和一些无用信息,我正在尝试获取任何信息任何住房,但包括住房信息的部分没有被获取

这是我试图从中获取数据的页面 link

我正在尝试将任何东西包含在

请我尝试使用这些库,但没有获取任何内容:scrapybeautifulsoup、来自 nodejs 的请求、来自 的请求>蟒蛇

这是我尝试过的一些代码:

from bs4 import BeautifulSoup
import requests

source = requests.get('https://www.orpi.com/recherche/buy?transaction=buy&resultUrl=&agency=&minSurface=&maxSurface=&newBuild=&oldBuild=&minPrice=&maxPrice=&sort=date-down&layoutType=mixte&nbBedrooms=&page=&minLotSurface=&maxLotSurface=&minStoryLocation=&maxStoryLocation=')

soup = BeautifulSoup(source.text, 'lxml')

print(soup.prettify())

#stories = []
#
#for a in soup.find_all('div', attrs={'class': 'u-mt-md'}):
#    stories.append([a])
#
#print stories[0]




#article = soup.find('div', attrs={'class':'u-mt-md'})

#one_article = article.find('a', class_='u-link-unstyled c-overlay__link').text

#html = article.prettify()

#print(article)

使用刮板:

import scrapy
import pprint

class SpiderSpider(scrapy.Spider):
    name = 'Orpi'
    start_urls = ['https://www.orpi.com/recherche/buy?transaction=buy&resultUrl=&agency=&minSurface=&maxSurface=&newBuild=&oldBuild=&minPrice=&maxPrice=&sort=date-down&layoutType=mixte&nbBedrooms=&page=&minLotSurface=&maxLotSurface=&minStoryLocation=&maxStoryLocation=']

    def parse(self, response):
        data = {}
        products = response.css('div.o-grid__col o-grid__col--8')
        for product in products:
            for p in product.css('div.o-grid__col u-flex u-flex-column'):
                yield {
                    'Images' : p.css('img.c-overlay__zoom u-cover::attr(src)').getall(),
                }  

1 个答案:

答案 0 :(得分:1)

from selenium import webdriver
from time import sleep

driver = webdriver.Firefox() # Or Chrome()
driver.get("https://www.orpi.com/recherche/rent?transaction=rent&resultUrl=&realEstateTypes%5B0%5D=maison&realEstateTypes%5B1%5D=appartement&realEstateTypes%5B2%5D=terrain&realEstateTypes%5B3%5D=immeuble&realEstateTypes%5B4%5D=stationnement&agency=&minSurface=&maxSurface=&newBuild=&oldBuild=&minPrice=&maxPrice=&sort=date-down&layoutType=mixte&nbBedrooms=&page=&minLotSurface=&maxLotSurface=&minStoryLocation=&maxStoryLocation=")

sleep(3)
html = driver.page_source    
driver.quit()
# Do your stuff with html

如果您有其他浏览器,请替换 driver

该页面已加载 javascript。你需要延迟才能得到结果。