Question

我正在尝试从MLB.com的某个页面呈现数据，该页面使用javascript生成我想要抓取的有用数据的html。我跟着this tutorial使用pyqt4来渲染html，但实际上没有渲染javascript，我只是回到了之前我想要的html！

这是我的代码

# Importing
from bs4 import BeautifulSoup
from urllib.request import urlopen

# Importing stuff for parsing javascript
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *

#basic function to get scrapy working
url2 = "removed due to size, link is above in post"

class Render(QWebPage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.loadFinished.connect(self._loadFinished)
        self.mainFrame().load(QUrl(url))
        self.app.exec()

    def _loadFinished(self, result):
        self.frame = self.mainFrame()
        self.app.quit()

r = Render(url2)
result = r.frame.toHtml().encode('utf-8')

print(result)

注意，我删除了上面代码中的url，但它与此帖子中的第一个链接是相同的url。

当我运行此代码时，如果我只是使用urllib2获取页面url并打印源代码，我将获得相同的html。我该怎么做才能让这段代码按照我的意愿运作？

编辑：以下是我从本教程其余部分中引出错误的代码的结尾：

r = Render(url2)
result = r.frame.toHtml()
formattedResult = str(result.toAscii())

此代码抛出此错误：

AttributeError: 'str' object has no attribute 'toAscii'

当我不调用toAscii（）而只调用str（result）时，我收到此错误：

UnicodeEncodeError: 'ascii' codec can't encode characters in position 21391-21393: ordinal not in range(128)

Answer 1

本教程底部的总代码错误。您仍然需要使用lxml：

处理页面

#QString should be converted to string before processed by lxml
formatted_result = str(result.toAscii())

#Next build lxml tree from formatted_result
tree = html.fromstring(formatted_result)

#Now using correct Xpath we are fetching URL of archives
archive_links = tree.xpath('//divass="campaign"]/a/@href')

print archive_links

作者最后没有包含其余的代码。

Answer 2

我找到了解决方案。而不是做

formattedResult = str(result.toAscii())

尝试

formattedResult = str(result.encode('utf-8'))

它对我有用。

使用python和PyQt4从mlb.com中抓取数据而不是渲染javascript

2 个答案: