Scrapy:错误:处理错误

时间:2014-05-21 20:43:42

标签: python scrapy scrapy-spider

我写过(实际上我已经修改了教程中的刮刀)样本刮刀:

from scrapy.spider import Spider
from scrapy.selector import Selector
from dirbot.items import Website


class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["cryptocoincharts.info"]
start_urls = [
    "http://www.cryptocoincharts.info/v2/coins/show/drk",
]

def parse(self, response):

    sel = Selector(response)
    sites = sel.xpath('//table[@class="table table-striped"]//tr[7]/td[2]')
    items = []

for site in sites:
    item = Website()
    item['name'] = site.xpath('text()').re('[^\t\n]+')
    items.append(item)
return items

我得到了一个处理错误,这里是log:

scrapy crawl dmoz -o items.json -t json

2014-05-21 22:26:54+0200 [scrapy] INFO: Scrapy 0.23.0-231-g2bf09b8 started (bot: scrapybot)
2014-05-21 22:26:54+0200 [scrapy] INFO: Optional features available: ssl, http11
2014-05-21 22:26:54+0200 [scrapy] INFO: Overridden settings: {'DEFAULT_ITEM_CLASS': 'dirbot.items.Website', 'FEED_FORMAT': 'json', 'SPIDER_MODULES': ['dirbot.spiders'], 'FEED_URI': 'items.json', 'NEWSPIDER_MODULE': 'dirbot.spiders'}
2014-05-21 22:26:54+0200 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-05-21 22:26:54+0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-05-21 22:26:54+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-05-21 22:26:54+0200 [scrapy] INFO: Enabled item pipelines: FilterWordsPipeline
2014-05-21 22:26:54+0200 [dmoz] INFO: Spider opened
2014-05-21 22:26:54+0200 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-05-21 22:26:54+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-05-21 22:26:54+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-05-21 22:26:54+0200 [dmoz] DEBUG: Crawled (200) <GET http://www.cryptocoincharts.info/v2/coins/show/drk> (referer: None)
2014-05-21 22:26:54+0200 [dmoz] ERROR: Error processing {'name': [u'0.0160990 BTC',
              u'7.9770495 USD',
              u'5.7816480 EUR',
              u'48.829847 CNY',
              u'4.7026302 GBP',
              u'6.9809075 CHF',
              u'8.6828030 CAD',
              u'811.85225 JPY',
              u'8.5037582 AUD',
              u'83.350117 ZAR',
              u'0.00595524\xa0oz GOLD (= 0.17\xa0grams)',
              u'0.37805922\xa0oz SILVER (= 10.72\xa0grams)']}
    Traceback (most recent call last):
      File "/usr/lib/pymodules/python2.7/scrapy/middleware.py", line 62, in _process_chain
        return process_chain(self.methods[methodname], obj, *args)
      File "/usr/lib/pymodules/python2.7/scrapy/utils/defer.py", line 65, in process_chain
        d.callback(input)
      File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 362, in callback
        self._startRunCallbacks(result)
      File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 458, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 545, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "/home/me/Desktop/scrapy/dirbot-master/dirbot-master/dirbot/pipelines.py", line 13, in process_item
        if word in unicode(item['description']).lower():
      File "/usr/lib/pymodules/python2.7/scrapy/item.py", line 50, in __getitem__
        return self._values[key]
    exceptions.KeyError: 'description'

2014-05-21 22:26:54+0200 [dmoz] ERROR: Error processing {'name': []}
    Traceback (most recent call last):
      File "/usr/lib/pymodules/python2.7/scrapy/middleware.py", line 62, in _process_chain
        return process_chain(self.methods[methodname], obj, *args)
      File "/usr/lib/pymodules/python2.7/scrapy/utils/defer.py", line 65, in process_chain
        d.callback(input)
      File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 362, in callback
        self._startRunCallbacks(result)
      File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 458, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 545, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "/home/me/Desktop/scrapy/dirbot-master/dirbot-master/dirbot/pipelines.py", line 13, in process_item
        if word in unicode(item['description']).lower():
      File "/usr/lib/pymodules/python2.7/scrapy/item.py", line 50, in __getitem__
        return self._values[key]
    exceptions.KeyError: 'description'

2014-05-21 22:26:54+0200 [dmoz] INFO: Closing spider (finished)
2014-05-21 22:26:54+0200 [dmoz] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 254,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 4986,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 5, 21, 20, 26, 54, 390997),
     'log_count/DEBUG': 3,
     'log_count/ERROR': 2,
     'log_count/INFO': 7,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start_time': datetime.datetime(2014, 5, 21, 20, 26, 54, 211942)}
2014-05-21 22:26:54+0200 [dmoz] INFO: Spider closed (finished)

我试图找出发生了什么,但遗憾的是我找不到任何原因导致它没有将项目导出到json文件。在早期的项目中,scrapy将多行数据导出到json而没有任何问题。

1 个答案:

答案 0 :(得分:2)

仔细观察追溯,有一行:

File "/home/me/Desktop/scrapy/dirbot-master/dirbot-master/dirbot/pipelines.py", line 13, in process_item
    if word in unicode(item['description']).lower():

这意味着您的pipeline在尝试处理项目时会抛出错误。

然后,看看你在蜘蛛中填写了哪些字段:

for site in sites:
    item = Website()
    item['name'] = site.xpath('text()').re('[^\t\n]+')
    items.append(item)

如您所见,未设置description字段。这就是错误的原因。