我正在尝试在Airbnb网站上抓取由JavaScript呈现的列表,因此常规的requests
+ beautiful soup
方法将无法使用。
相反,从此线程:
How to "render" HTML with PyQt5's QWebEngineView
我借用了以下代码,该代码应与JavaScript一起呈现页面:
import requests
import bs4
def render(source_html):
"""Fully render HTML, JavaScript and all."""
import sys
from PyQt5.QtCore import QEventLoop
from PyQt5.QtWidgets import QApplication
from PyQt5.QtWebEngineWidgets import QWebEngineView
class Render(QWebEngineView):
def __init__(self, html):
self.html = None
self.app = QApplication(sys.argv)
QWebEngineView.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.setHtml(html)
while self.html is None:
self.app.processEvents(QEventLoop.ExcludeUserInputEvents | QEventLoop.ExcludeSocketNotifiers | QEventLoop.WaitForMoreEvents)
self.app.quit()
def _callable(self, data):
self.html = data
def _loadFinished(self, result):
self.page().toHtml(self._callable)
return Render(source_html).html
url = 'https://www.airbnb.pl/s/Girona--Hiszpania/homes?refinement_paths%5B%5D=%2Fhomes&place_id=ChIJRRrTHsPNuhIRQMqjIeD6AAM&query=Girona%2C%20Hiszpania&checkin=2018-07-03&checkout=2018-07-20&allow_override%5B%5D=&s_tag=AtfIQ5_V'
sample_html = requests.get(url).text
rendered_page = render(sample_html)
但是,当我print(rendered_page)
时,清单(例如ID属性为“ listing-1756555”的第一个清单)仍然不存在。
有什么办法可以解决这个问题?
我不想使用Selenium,因为它需要“安装” Webdriver,但是此刮板是Django Web应用程序。