我正在使用scrapy 0.2和python2.7
我想知道我现在正在抓取的链接是否已被删除。
我搜索了很多,我找到了这个例子how to filter duplicate requests based on url in scrapy
我复制了代码并将其放在我的spider文件夹中并更改了设置,但我得到了这个例外:
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 1237, in unwindGenerator
return _inlineCallbacks(None, gen, Deferred())
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 1099, in _inlineCallbacks
result = g.send(result)
File "C:\Python27\lib\site-packages\scrapy-0.20.2-py2.7.egg\scrapy\crawler.py", line 66, in start
yield self.engine.open_spider(self._spider, self._start_requests())
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 1237, in unwindGenerator
return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 1099, in _inlineCallbacks
result = g.send(result)
File "C:\Python27\lib\site-packages\scrapy-0.20.2-py2.7.egg\scrapy\core\engine.py", line 221, in open_spider
scheduler = self.scheduler_cls.from_crawler(self.crawler)
File "C:\Python27\lib\site-packages\scrapy-0.20.2-py2.7.egg\scrapy\core\scheduler.py", line 25, in from_crawler
dupefilter = dupefilter_cls.from_settings(settings)
exceptions.AttributeError: 'module' object has no attribute 'from_settings'
我的代码:
import os
from scrapy.dupefilter import RFPDupeFilter
from scrapy.utils.request import request_fingerprint
class CustomFilter(RFPDupeFilter):
def __getid(self, url):
mm = url.split("&refer")[0] #or something like that
return mm
def request_seen(self, request):
print "SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS"
fp = self.__getid(request.url)
if fp in self.fingerprints:
return True
self.fingerprints.add(fp)
if self.file:
self.file.write(fp + os.linesep)
在设置中我添加了这个:
DUPEFILTER_CLASS = 'myproject.spiders.CustomFilter'