使用scrapy我遇到了javascript渲染页面的问题。对于网站论坛特许经营,例如链接http://www.idee-franchise.com/forum/viewtopic.php?f=3&t=69,试图废弃源html我无法检索任何帖子,因为它们似乎在页面呈现后“附加”(可能通过javascript)。
所以我在网上寻找解决这个问题的方法,我遇到了https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages/。
我对PYPQ完全不熟悉,但希望采用快捷方式并复制粘贴一些代码。
当我试图废弃单个页面时,这非常有效。但是当我在scrapy中实现这个时,我得到以下错误:
QObject::connect: Cannot connect (null)::configurationAdded(QNetworkConfiguration) to QNetworkConfigurationManager::configurationAdded(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::configurationRemoved(QNetworkConfiguration) to QNetworkConfigurationManager::configurationRemoved(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::configurationChanged(QNetworkConfiguration) to QNetworkConfigurationManager::configurationChanged(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::onlineStateChanged(bool) to QNetworkConfigurationManager::onlineStateChanged(bool)
QObject::connect: Cannot connect (null)::configurationUpdateComplete() to QNetworkConfigurationManager::updateCompleted()
如果我废弃单个页面,则不会发生错误,但是当我将爬虫设置为递归模式时,然后在第二个链接处,我收到python.exe停止工作的错误以及上述错误。
我将搜索它可能是什么,并且在某处我读取QApplication对象应该只启动一次。
有人可以告诉我应该正确实施什么?
蜘蛛
# -*- coding: utf-8 -*-
import scrapy
import sys, traceback
from bs4 import BeautifulSoup as bs
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from crawler.items import ThreadItem, PostItem
from crawler.utils import utils
class IdeefranchiseSpider(CrawlSpider):
name = "ideefranchise"
allowed_domains = ["idee-franchise.com"]
start_urls = (
'http://www.idee-franchise.com/forum/',
# 'http://www.idee-franchise.com/forum/viewtopic.php?f=3&t=69',
)
rules = [
Rule(LinkExtractor(allow='/forum/'), callback='parse_thread', follow=True)
]
def parse_thread(self, response):
print "Parsing Thread", response.url
thread = ThreadItem()
thread['url'] = response.url
thread['domain'] = self.allowed_domains[0]
thread['title'] = self.get_thread_title(response)
thread['forumname'] = self.get_thread_forum_name(response)
thread['posts'] = self.get_thread_posts(response)
yield thread
# paginate if possible
next_page = response.css('fieldset.display-options > a::attr("href")')
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse_thread)
def get_thread_posts(self, response):
# using PYQTRenderor to reload page. I think this is where the problem
# occurs, when i initiate the PYQTPageRenderor object.
soup = bs(unicode(utils.PYQTPageRenderor(response.url).get_html()))
# sleep so that PYQT can render page
# time.sleep(5)
# comments
posts = []
for item in soup.select("div.post.bg2") + soup.select("div.post.bg1"):
try:
post = PostItem()
post['profile'] = item.select("p.author > strong > a")[0].get_text()
details = item.select('dl.postprofile > dd')
post['date'] = details[2].get_text()
post['content'] = item.select('div.content')[0].get_text()
# appending the comment
posts.append(post)
except:
e = sys.exc_info()[0]
self.logger.critical("ERROR GET_THREAD_POSTS %s", e)
traceback.print_exc(file=sys.stdout)
return posts
PYPQ实施
import sys
from PyQt4.QtCore import QUrl
from PyQt4.QtGui import QApplication
from PyQt4.QtWebKit import QWebPage
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
class PYQTPageRenderor(object):
def __init__(self, url):
self.url = url
def get_html(self):
r = Render(self.url)
return unicode(r.frame.toHtml())
答案 0 :(得分:0)
正确的实现,如果你想自己做,就是创建一个使用PyQt来处理请求的downlader middleware。它将由Scrapy实例化一次。
不应该那么复杂,只是
在项目的middleware.py
文件中创建QTDownloader类
构造函数应创建QApplication
对象。
process_request
方法应该进行网址加载和HTML提取。请注意,您返回带有HTML字符串的Response对象。
您可以使用_cleanup
班级的方法进行适当的清理。
最后,通过将中间件添加到项目DOWNLOADER_MIDDLEWARES
文件的settings.py
变量中来激活它。
如果您不想编写自己的解决方案,可以使用使用Selenium的现有中间件进行下载,例如scrapy-webdriver。如果您不想拥有可见的浏览器,可以指示它使用PhantomJS。
EDIT1:
因此Rejected所指出的正确的方式是使用下载处理程序。这个想法是类似的,但下载应该在download_request
方法中进行,并且应该通过将其添加到DOWNLOAD_HANDLERS
来启用。请查看WebdriverDownloadHandler以获取示例。