Question

我正在尝试将javascripted网页呈现为填充的HTML以进行抓取。研究不同的解决方案（硒，对页面进行反向工程等）使我得以this技术，但我无法使其工作。 BTW我是python的新手，主要是在剪切/粘贴/实验阶段。得到了安装和缩进问题，但我现在卡住了。

在下面的测试代码中，print（sample_html）工作并返回目标页面的原始html，但print（render（sample_html））总是返回单词'None'。

有趣的是，如果你在amazon.com上运行它，他们会发现它不是真正的浏览器，并返回带有关于自动访问警告的HTML。然而，其他测试页面提供了应该呈现的真实html，除非它没有。

如何排除结果总是返回“无”？

def render(source_html):
    """Fully render HTML, JavaScript and all."""

    import sys
    from PyQt5.QtWidgets import QApplication
    from PyQt5.QtWebEngineWidgets import QWebEngineView

    class Render(QWebEngineView):
        def __init__(self, html):
            self.html = None
            self.app = QApplication(sys.argv)
            QWebEngineView.__init__(self)
            self.loadFinished.connect(self._loadFinished)
            self.setHtml(html)
            self.app.exec_()

        def _loadFinished(self, result):
            # This is an async call, you need to wait for this
            # to be called before closing the app
            self.page().toHtml(self.callable)

        def callable(self, data):
            self.html = data
            # Data has been stored, it's safe to quit the app
            self.app.quit()

            return Render(source_html).html

import requests
#url = 'http://webscraping.com'  
#url='http://www.amazon.com'
url='https://www.ncbi.nlm.nih.gov/nuccore/CP002059.1'
sample_html = requests.get(url).text
print(sample_html)
print(render(sample_html))

编辑：感谢代码中包含的回复。但现在它返回一个错误，脚本挂起，直到我杀死python启动器，然后导致段错误：

这是修订后的代码：

def render(source_url):
"""Fully render HTML, JavaScript and all."""

import sys
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
from PyQt5.QtWebEngineWidgets import QWebEngineView

class Render(QWebEngineView):
    def __init__(self, url):
        self.html = None
        self.app = QApplication(sys.argv)
        QWebEngineView.__init__(self)
        self.loadFinished.connect(self._loadFinished)
        #self.setHtml(html)
        self.load(QUrl(url))
        self.app.exec_()

    def _loadFinished(self, result):
        # This is an async call, you need to wait for this
        # to be called before closing the app
        self.page().toHtml(self._callable)

    def _callable(self, data):
        self.html = data
        # Data has been stored, it's safe to quit the app
        self.app.quit()

return Render(source_url).html

#url = 'http://webscraping.com'  
#url='http://www.amazon.com'
url="https://www.ncbi.nlm.nih.gov/nuccore/CP002059.1"
print(render(url))

这引发了这些错误：

$ python3 -tt fees-pkg-v2.py
Traceback (most recent call last):
  File "fees-pkg-v2.py", line 30, in _callable
    self.html = data
AttributeError: 'method' object has no attribute 'html'
None   (hangs here until force-quit python launcher)
Segmentation fault: 11
$

我已经开始阅读python类，以完全理解我在做什么（总是一件好事）。我在想我的环境中可能存在的问题（OSX Yosemite，Python 3.4.3，Qt5.4.1，sip-4.16.6）。还有其他建议吗？

Answer 1

问题在于环境。我手动安装了Python 3.4.3，Qt5.4.1和sip-4.16.6，并且必须搞砸了。安装Anaconda后，脚本开始工作。再次感谢。

使用PyQt5和QWebEngineView刮取javascript页面

1 个答案: