我开始测试Scrapy以便抓取网站但是当我测试我的代码时,我得到一个错误,我似乎无法理解如何解决。
以下是错误输出:
...
2012-12-18 02:07:19+0000 [dmoz] DEBUG: Crawled (200) <GET http://MYURL.COM> (referer: None)
2012-12-18 02:07:19+0000 [dmoz] ERROR: Spider error processing <GET http://MYURL.COM>
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/Twisted-12.2.0-py2.7-linux-x86_64.egg/twisted/internet/base.py", line 1178, in mainLoop
self.runUntilCurrent()
File "/usr/local/lib/python2.7/dist-packages/Twisted-12.2.0-py2.7-linux-x86_64.egg/twisted/internet/base.py", line 800, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/usr/local/lib/python2.7/dist-packages/Twisted-12.2.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 368, in callback
self._startRunCallbacks(result)
File "/usr/local/lib/python2.7/dist-packages/Twisted-12.2.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 464, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/usr/local/lib/python2.7/dist-packages/Twisted-12.2.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 551, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/lib/python2.7/dist-packages/Scrapy-0.16.3-py2.7.egg/scrapy/spider.py", line 57, in parse
raise NotImplementedError
exceptions.NotImplementedError:
2012-12-18 02:07:19+0000 [dmoz] INFO: Closing spider (finished)
2012-12-18 02:07:19+0000 [dmoz] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 357,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 20704,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 12, 18, 2, 7, 19, 595977),
'log_count/DEBUG': 7,
'log_count/ERROR': 1,
'log_count/INFO': 4,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/NotImplementedError': 1,
'start_time': datetime.datetime(2012, 12, 18, 2, 7, 18, 836322)}
看起来这可能与我的parse
函数和回调有关。我尝试删除rule
,它只能用于1个单一的URL,我需要的是抓取整个网站。
这是我的代码
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from tutorial.items import DmozItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
class DmozSpider(BaseSpider):
name = "dmoz"
start_urls = ["http://MYURL.COM"]
rules = (Rule(SgmlLinkExtractor(allow_domains=('http://MYURL.COM', )), callback='parse_l', follow=True),)
def parse_l(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@class=\'content\']')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.select('//div[@class=\'gig-title-g\']/h1').extract()
item['link'] = site.select('//ul[@class=\'gig-stats prime\']/li[@class=\'queue \']/div[@class=\'big-txt\']').extract()
item['desc'] = site.select('//li[@class=\'thumbs\'][1]/div[@class=\'gig-stats-numbers\']/span').extract()
items.append(item)
return items
任何正确方向的提示都将受到赞赏。
非常感谢!
答案 0 :(得分:3)
找到这个问题的答案:
Why does scrapy throw an error for me when trying to spider and parse a site?
看起来BaseSpider
没有实现Rule
如果您偶然发现了这个问题并且正在使用BaseSpider
进行抓取,则需要将其更改为CrawlSpider
,然后按照http://doc.scrapy.org/en/latest/topics/spiders.html中的说明将其导入
from scrapy.contrib.spiders import CrawlSpider, Rule