我是Scrapy的初学者。我想收集索引页面中的项目链接,并从项目页面获取信息。因为我需要在索引页面上处理javascript,所以我使用selenium webdriver和scrapy。这是我在progress.py中的代码。
from scrapy.spider import Spider
from scrapy.http import Request
from selenium import selenium
from selenium import webdriver
from mustdo.items import MustdoItem
import time
class ProgressSpider(Spider):
name = 'progress' # spider's name
allowed_domains = ['example.com'] # crawling domain
start_urls = ['http://www.example.com']
def __init__(self):
Spider.__init__(self)
self.log('----------in __init__----------')
self.driver = webdriver.Firefox()
def parse(self, response):
self.log('----------in parse----------')
self.driver.get(response.url)
# Here're some operations of self.driver with javascript.
elements = []
elements = self.driver.find_elements_by_xpath('//table/tbody/tr/td/a[1]')
#get the number of the item
self.log('----------Link number is----------'+str(len(elements)))
for element in elements:
#get the url of the item
href = element.get_attribute('href')
print href
self.log('----------next href is ----------'+href)
yield Request(href,callback=self.parse_item)
self.driver.close()
def parse_item(self, response):
self.log('----------in parse_item----------')
self.driver.get(response.url)
#build item
item = MustdoItem()
item['title'] = self.driver.find_element_by_xpath('//h2').text
self.log('----------item created----------'+self.driver.find_element_by_xpath('//h2').text)
time.sleep(10)
return item
另外,我有items.py定义了这里使用的MustdoItem。这是代码。
from scrapy.item import Item, Field
class MustdoItem(Item):
title = Field()
当我运行蜘蛛时,我可以得到几件物品(20件中可能有6到7件)。但过了一段时间,我收到如下错误消息。
Traceback (most recent call last):
File "F:\Python27\lib\site-packages\twisted\internet\base.py", line 82
4, in runUntilCurrent
call.func(*call.args, **call.kw)
File "F:\Python27\lib\site-packages\twisted\internet\task.py", line 63
8, in _tick
taskObj._oneWorkUnit()
File "F:\Python27\lib\site-packages\twisted\internet\task.py", line 48
4, in _oneWorkUnit
result = next(self._iterator)
File "F:\Python27\lib\site-packages\scrapy-0.22.2-py2.7.egg\scrapy\uti
ls\defer.py", line 57, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
File "F:\Python27\lib\site-packages\scrapy-0.22.2-py2.7.egg\scrapy\uti
ls\defer.py", line 96, in iter_errback
yield next(it)
File "F:\Python27\lib\site-packages\scrapy-0.22.2-py2.7.egg\scrapy\con
trib\spidermiddleware\offsite.py", line 23, in process_spider_output
for x in result:
File "F:\Python27\lib\site-packages\scrapy-0.22.2-py2.7.egg\scrapy\con
trib\spidermiddleware\referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "F:\Python27\lib\site-packages\scrapy-0.22.2-py2.7.egg\scrapy\con
trib\spidermiddleware\urllength.py", line 33, in <genexpr>
return (r for r in result or () if _filter(r))
File "F:\Python27\lib\site-packages\scrapy-0.22.2-py2.7.egg\scrapy\con
trib\spidermiddleware\depth.py", line 50, in <genexpr>
return (r for r in result or () if _filter(r))
File "mustdo\spiders\progress.py", line 32, in parse
print element.tag_name
File "F:\Python27\lib\site-packages\selenium\webdriver\remote\webeleme
nt.py", line 50, in tag_name
return self._execute(Command.GET_ELEMENT_TAG_NAME)['value']
File "F:\Python27\lib\site-packages\selenium\webdriver\remote\webeleme
nt.py", line 369, in _execute
return self._parent.execute(command, params)
File "F:\Python27\lib\site-packages\selenium\webdriver\remote\webdrive
r.py", line 164, in execute
self.error_handler.check_response(response)
File "F:\Python27\lib\site-packages\selenium\webdriver\remote\errorhan
dler.py", line 164, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: u'El
ement not found in the cache - perhaps the page has changed since it was looked
up' ; Stacktrace:
at fxdriver.cache.getElementAt (resource://fxdriver/modules/web_elem
ent_cache.js:7610)
at Utils.getElementAt (file:///c:/users/marian/appdata/local/temp/tm
pmgnqid/extensions/fxdriver@googlecode.com/components/command_processor.js:7210)
at WebElement.getElementTagName (file:///c:/users/marian/appdata/loc
al/temp/tmpmgnqid/extensions/fxdriver@googlecode.com/components/command_processo
r.js:10353)
at DelayedCommand.prototype.executeInternal_/h (file:///c:/users/mar
ian/appdata/local/temp/tmpmgnqid/extensions/fxdriver@googlecode.com/components/c
ommand_processor.js:10878)
at DelayedCommand.prototype.executeInternal_ (file:///c:/users/maria
n/appdata/local/temp/tmpmgnqid/extensions/fxdriver@googlecode.com/components/com
mand_processor.js:10883)
at DelayedCommand.prototype.execute/< (file:///c:/users/marian/appda
ta/local/temp/tmpmgnqid/extensions/fxdriver@googlecode.com/components/command_pr
ocessor.js:10825)
我测试了我的代码并发现如果我删除&#34;产生请求(href,callback = self.parse_item)&#34;在解析函数中,我可以获得所有项目的链接。当&#34; progress.py&#34;正在运行,我观察到在第一次打印&#34; ----------在parse_item ----------&#34;在self.log中,错误消息出来了。根据我的推断,屈服顺序会导致错误。但我不知道如何处理这个问题。
非常感谢任何见解!
祝你好运! :)