我尝试通过多种方式访问 orpi 的网站和我制作的每个程序,并且每个请求仅返回 HTML 上可用的数据,即导航栏和一些无用信息,我正在尝试获取任何信息任何住房,但包括住房信息的部分没有被获取
这是我试图从中获取数据的页面 link
我正在尝试将任何东西包含在
请我尝试使用这些库,但没有获取任何内容:scrapy、beautifulsoup、来自 nodejs 的请求、来自 的请求>蟒蛇。
这是我尝试过的一些代码:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.orpi.com/recherche/buy?transaction=buy&resultUrl=&agency=&minSurface=&maxSurface=&newBuild=&oldBuild=&minPrice=&maxPrice=&sort=date-down&layoutType=mixte&nbBedrooms=&page=&minLotSurface=&maxLotSurface=&minStoryLocation=&maxStoryLocation=')
soup = BeautifulSoup(source.text, 'lxml')
print(soup.prettify())
#stories = []
#
#for a in soup.find_all('div', attrs={'class': 'u-mt-md'}):
# stories.append([a])
#
#print stories[0]
#article = soup.find('div', attrs={'class':'u-mt-md'})
#one_article = article.find('a', class_='u-link-unstyled c-overlay__link').text
#html = article.prettify()
#print(article)
使用刮板:
import scrapy
import pprint
class SpiderSpider(scrapy.Spider):
name = 'Orpi'
start_urls = ['https://www.orpi.com/recherche/buy?transaction=buy&resultUrl=&agency=&minSurface=&maxSurface=&newBuild=&oldBuild=&minPrice=&maxPrice=&sort=date-down&layoutType=mixte&nbBedrooms=&page=&minLotSurface=&maxLotSurface=&minStoryLocation=&maxStoryLocation=']
def parse(self, response):
data = {}
products = response.css('div.o-grid__col o-grid__col--8')
for product in products:
for p in product.css('div.o-grid__col u-flex u-flex-column'):
yield {
'Images' : p.css('img.c-overlay__zoom u-cover::attr(src)').getall(),
}
答案 0 :(得分:1)
from selenium import webdriver
from time import sleep
driver = webdriver.Firefox() # Or Chrome()
driver.get("https://www.orpi.com/recherche/rent?transaction=rent&resultUrl=&realEstateTypes%5B0%5D=maison&realEstateTypes%5B1%5D=appartement&realEstateTypes%5B2%5D=terrain&realEstateTypes%5B3%5D=immeuble&realEstateTypes%5B4%5D=stationnement&agency=&minSurface=&maxSurface=&newBuild=&oldBuild=&minPrice=&maxPrice=&sort=date-down&layoutType=mixte&nbBedrooms=&page=&minLotSurface=&maxLotSurface=&minStoryLocation=&maxStoryLocation=")
sleep(3)
html = driver.page_source
driver.quit()
# Do your stuff with html
如果您有其他浏览器,请替换 driver。
该页面已加载 javascript。你需要延迟才能得到结果。