我对抓取动态加载的网站是陌生的,而我一直在努力抓取该网站的团队名称和赔率
https://www.cashpoint.com/de/fussball/deutschland/bundesliga
我在这篇文章中使用PyQt5尝试过
PyQt4 to PyQt5 -> mainFrame() deprecated, need fix to load web pages
class Page(QWebEnginePage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebEnginePage.__init__(self)
self.html = ''
self.loadFinished.connect(self._on_load_finished)
self.load(QUrl(url))
self.app.exec_()
def _on_load_finished(self):
self.html = self.toHtml(self.Callable)
print('Load finished')
def Callable(self, html_str):
self.html = html_str
self.app.quit()
def main():
page = Page('https://www.cashpoint.com/de/fussball/deutschland/bundesliga')
soup = bs.BeautifulSoup(page.html, 'html.parser')
js_test = soup.find('div', class_='game__team game__team__football')
print(js_test.text)
if __name__ == '__main__': main()
但是它不适用于我要抓取的网站。我得到了
AttributeError: 'NoneType' object has no attribute 'text' Error
。尽管上面的文章中有一种为动态加载的网站编写的方法,但我没有通过这种方法获得网站的内容。如我所读,处理动态加载的网站的第一种方法是确定如何在页面上呈现数据。我该怎么办?为什么PyQt5无法在该网站上使用? Selenium的方式对我来说不是一个选择,因为它太慢了,无法获得现场赔率。当我检查网站以使其正常使用Beautifulsoup或Scrapy时,是否可以获取该网站的html内容?预先谢谢你。
答案 0 :(得分:1)
提供的代码失败,因为即使页面已完成加载,异步创建的新元素(例如div)也要创建,例如“ divs”和“ game__team__football”,因此在发出loadFinished信号时,即使这些元素也是如此。没有创建。
一种可能的解决方案是直接使用javascript通过runJavaScript()方法获取文本列表,如果列表为空,则在时间T再次尝试,直到列表不为空。
import sys
from PyQt5 import QtCore, QtWidgets, QtWebEngineWidgets
class Scrapper(QtCore.QObject):
def __init__(self, interval=500, parent=None):
super().__init__(parent)
self._result = []
self._interval = interval
self.page = QtWebEngineWidgets.QWebEnginePage(self)
self.page.loadFinished.connect(self.on_load_finished)
self.page.load(
QtCore.QUrl("https://www.cashpoint.com/de/fussball/deutschland/bundesliga")
)
@property
def result(self):
return self._result
@property
def interval(self):
return self._interval
@interval.setter
def interval(self, interval):
self._interval = interval
@QtCore.pyqtSlot(bool)
def on_load_finished(self, ok):
if ok:
self.execute_javascript()
else:
QtCore.QCoreApplication.exit(-1)
def execute_javascript(self):
self.page.runJavaScript(
"""
function text_by_classname(classname){
var texts = [];
var elements = document.getElementsByClassName(classname);
for (const e of elements) {
texts.push(e.textContent);
}
return texts;
}
[].concat(text_by_classname("game__team"), text_by_classname("game__team__football"));
""",
self.javascript_callback,
)
def javascript_callback(self, result):
if result:
self._result = result
QtCore.QCoreApplication.quit()
else:
QtCore.QTimer.singleShot(self.interval, self.execute_javascript)
def main():
app = QtWidgets.QApplication(sys.argv)
scrapper = Scrapper(interval=1000)
app.exec_()
result = scrapper.result
del scrapper, app
print(result)
if __name__ == "__main__":
main()
输出:
[' 1899 Hoffenheim ', ' FC Augsburg ', ' Bayern München ', ' Werder Bremen ', ' Hertha BSC ', ' SC Freiburg ', ' 1. Fsv Mainz 05 ', ' Borussia Dortmund ', ' 1. FC Köln ', ' Bayer 04 Leverkusen ', ' SC Paderborn ', ' FC Union Berlin ', ' Fortuna Düsseldorf ', ' RB Leipzig ', ' VFL Wolfsburg ', ' Borussia Mönchengladbach ', ' FC Schalke 04 ', ' Eintracht Frankfurt ', ' Werder Bremen ', ' 1. Fsv Mainz 05 ', ' Borussia Dortmund ', ' RB Leipzig ', ' FC Augsburg ', ' Fortuna Düsseldorf ', ' FC Union Berlin ', ' 1899 Hoffenheim ', ' Bayer 04 Leverkusen ', ' Hertha BSC ', ' Borussia Mönchengladbach ', ' SC Paderborn ', ' VFL Wolfsburg ', ' FC Schalke 04 ', ' Eintracht Frankfurt ', ' 1. FC Köln ', ' SC Freiburg ', ' Bayern München ', ' 1899 Hoffenheim ', ' Borussia Dortmund ', ' Bayern München ', ' VFL Wolfsburg ', ' 1899 Hoffenheim ', ' Bayern München ', ' Hertha BSC ', ' 1. Fsv Mainz 05 ', ' 1. FC Köln ', ' SC Paderborn ', ' Fortuna Düsseldorf ', ' VFL Wolfsburg ', ' FC Schalke 04 ', ' Werder Bremen ', ' Borussia Dortmund ', ' FC Augsburg ', ' FC Union Berlin ', ' Bayer 04 Leverkusen ', ' Borussia Mönchengladbach ', ' VFL Wolfsburg ', ' Eintracht Frankfurt ', ' SC Freiburg ', ' 1899 Hoffenheim ', ' Bayern München ']
答案 1 :(得分:0)
我对您的建议是使用硒作为解决方案:
pip安装硒
from selenium import webdriver
from bs4 import BeautifulSoup as soup
driver = webdriver.Firefox(executable_path = '/Users/alireza/Downloads/geckodriver')
driver.get(URL)
driver.maximize_window()
page_source = driver.page_source
page_soup = soup(page_source, 'html.parser')
js_test = page_soup.find("div", {"class":"game__team game__team__football"})
print(js_test.text)
您可以从here下载geckodriver
如果您想查看示例代码,可以检查here 这是www.tripadvisor.com的网页抓取工具。 希望这会有所帮助。