使用Scrapy进行爬网时出现异常错误

时间:2012-12-18 02:25:31

标签: python scrapy

我开始测试Scrapy以便抓取网站但是当我测试我的代码时,我得到一个错误,我似乎无法理解如何解决。

以下是错误输出:

...
2012-12-18 02:07:19+0000 [dmoz] DEBUG: Crawled (200) <GET http://MYURL.COM> (referer: None)
2012-12-18 02:07:19+0000 [dmoz] ERROR: Spider error processing <GET http://MYURL.COM>
    Traceback (most recent call last):
      File "/usr/local/lib/python2.7/dist-packages/Twisted-12.2.0-py2.7-linux-x86_64.egg/twisted/internet/base.py", line 1178, in mainLoop
        self.runUntilCurrent()
      File "/usr/local/lib/python2.7/dist-packages/Twisted-12.2.0-py2.7-linux-x86_64.egg/twisted/internet/base.py", line 800, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/usr/local/lib/python2.7/dist-packages/Twisted-12.2.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 368, in callback
        self._startRunCallbacks(result)
      File "/usr/local/lib/python2.7/dist-packages/Twisted-12.2.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 464, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "/usr/local/lib/python2.7/dist-packages/Twisted-12.2.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 551, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "/usr/local/lib/python2.7/dist-packages/Scrapy-0.16.3-py2.7.egg/scrapy/spider.py", line 57, in parse
        raise NotImplementedError
    exceptions.NotImplementedError: 

2012-12-18 02:07:19+0000 [dmoz] INFO: Closing spider (finished)
2012-12-18 02:07:19+0000 [dmoz] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 357,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 20704,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2012, 12, 18, 2, 7, 19, 595977),
     'log_count/DEBUG': 7,
     'log_count/ERROR': 1,
     'log_count/INFO': 4,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'spider_exceptions/NotImplementedError': 1,
     'start_time': datetime.datetime(2012, 12, 18, 2, 7, 18, 836322)}

看起来这可能与我的parse函数和回调有关。我尝试删除rule,它只能用于1个单一的URL,我需要的是抓取整个网站。

这是我的代码

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from tutorial.items import DmozItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item


class DmozSpider(BaseSpider):
    name = "dmoz"
    start_urls = ["http://MYURL.COM"]
    rules = (Rule(SgmlLinkExtractor(allow_domains=('http://MYURL.COM', )), callback='parse_l', follow=True),)


    def parse_l(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//div[@class=\'content\']')
       items = []
       for site in sites:
           item = DmozItem()
           item['title'] = site.select('//div[@class=\'gig-title-g\']/h1').extract()
           item['link'] = site.select('//ul[@class=\'gig-stats prime\']/li[@class=\'queue \']/div[@class=\'big-txt\']').extract()
           item['desc'] = site.select('//li[@class=\'thumbs\'][1]/div[@class=\'gig-stats-numbers\']/span').extract()
           items.append(item)
       return items 

任何正确方向的提示都将受到赞赏。

非常感谢!

1 个答案:

答案 0 :(得分:3)

找到这个问题的答案:

Why does scrapy throw an error for me when trying to spider and parse a site?

看起来BaseSpider没有实现Rule

如果您偶然发现了这个问题并且正在使用BaseSpider进行抓取,则需要将其更改为CrawlSpider,然后按照http://doc.scrapy.org/en/latest/topics/spiders.html中的说明将其导入

from scrapy.contrib.spiders import CrawlSpider, Rule