Scrapy和Twisted错误

时间:2015-08-24 17:01:27

标签: python scrapy twisted

我继承了一个项目并试图解决一个问题,我不得不升级所有项目的所有包。在这样做的过程中,我遇到了更多的问题而且我的智慧结束了。

这是一个使用大量软件包的网络抓取项目,我已将Scrapy和Twisted更新到最新版本,现在当我从cmd行运行我的刮刀时遇到以下错误。我已经尝试降级扭曲和卸载/重新安装但仍然得到相同的错误。

我正在运行Windows 8.1

这是错误:

    c:\RND\scraper\crawlers>scrapy crawl reuters
2015-08-24 12:40:34 [scrapy] INFO: Scrapy 1.0.3 started (bot: crawlers)
2015-08-24 12:40:34 [scrapy] INFO: Optional features available: ssl, http11
2015-08-24 12:40:34 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'cr
awlers.spiders', 'DUPEFILTER_CLASS': 'crawlers.utils.DuplicateArticleFilter', 'S
PIDER_MODULES': ['crawlers.spiders.reuters', 'crawlers.spiders.bbc', 'crawlers.s
piders.canwildlife', 'crawlers.spiders.usgs'], 'BOT_NAME': 'crawlers', 'USER_AGE
NT': 'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like G
ecko) Chrome/32.0.1667.0 Safari/537.36', 'DOWNLOAD_DELAY': 1}
2015-08-24 12:40:35 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsol
e, LogStats, CoreStats, SpiderState
c:\Python27\lib\site-packages\twisted\internet\endpoints.py:29: DeprecationWarni
ng: twisted.internet.interfaces.IStreamClientEndpointStringParser was deprecated
 in Twisted 14.0.0: This interface has been superseded by IStreamClientEndpointS
tringParserWithReactor.
  from twisted.internet.interfaces import (

2015-08-24 12:40:35 [py.warnings] WARNING: c:\Python27\lib\site-packages\twisted
\internet\endpoints.py:29: DeprecationWarning: twisted.internet.interfaces.IStre
amClientEndpointStringParser was deprecated in Twisted 14.0.0: This interface ha
s been superseded by IStreamClientEndpointStringParserWithReactor.
  from twisted.internet.interfaces import (

2015-08-24 12:40:36 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddl
eware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH
eadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMidd
leware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-08-24 12:40:36 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddlewa
re, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-08-24 12:40:36 [scrapy] INFO: Enabled item pipelines: MongodbExportPipeline

2015-08-24 12:40:36 [scrapy] INFO: Spider opened
Unhandled error in Deferred:
2015-08-24 12:40:36 [twisted] CRITICAL: Unhandled error in Deferred:


Traceback (most recent call last):
  File "c:\Python27\lib\site-packages\twisted\internet\defer.py", line 1274, in
unwindGenerator
    return _inlineCallbacks(None, gen, Deferred())
  File "c:\Python27\lib\site-packages\twisted\internet\defer.py", line 1128, in
_inlineCallbacks
    result = g.send(result)
  File "c:\Python27\lib\site-packages\scrapy\crawler.py", line 73, in crawl
    yield self.engine.open_spider(self.spider, start_requests)
  File "c:\Python27\lib\site-packages\twisted\internet\defer.py", line 1274, in
unwindGenerator
    return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
  File "c:\Python27\lib\site-packages\twisted\internet\defer.py", line 1128, in
_inlineCallbacks
    result = g.send(result)
  File "c:\Python27\lib\site-packages\scrapy\core\engine.py", line 232, in open_
spider
    scheduler = self.scheduler_cls.from_crawler(self.crawler)
  File "c:\Python27\lib\site-packages\scrapy\core\scheduler.py", line 28, in fro
m_crawler
    dupefilter = dupefilter_cls.from_settings(settings)
  File "c:\Python27\lib\site-packages\scrapy\dupefilters.py", line 44, in from_s
ettings
    return cls(job_dir(settings), debug)
exceptions.TypeError: __init__() takes at most 2 arguments (3 given)
2015-08-24 12:40:36 [twisted] CRITICAL:

这是我的点子列表:

amqp (1.4.6)
anyjson (0.3.3)
billiard (3.3.0.16)
celery (3.1.9)
cffi (1.1.2)
characteristic (14.3.0)
cryptography (0.9.3)
cssselect (0.9.1)
cython (0.20.1)
django (1.6.1)
django-extensions (1.3.
django-guardian (1.1.1)
django-userena (1.2.4)
dstk (0.50)
easy-thumbnails (1.4)
egenix (0.13.0-1.0.0j-1
enum34 (1.0.4)
geomet (0.1.0)
html2text (3.200.3)
idna (2.0)
ipaddress (1.0.12)
ipython (1.1.0)
kombu (3.0.24)
lxml (3.4.4)
mongoengine (0.8.7)
ndg-httpsclient (0.4.0)
pillow (2.3.0)
pip (7.1.0)
psycopg2 (2.5.2)
pyasn1 (0.1.8)
pyasn1-modules (0.0.5)
pycparser (2.14)
pymongo (2.6.3)
pyOpenSSL (0.15.1)
pyreadline (2.0)
python-dateutil (2.2)
pytz (2014.1)
queuelib (1.2.2)
requests (2.7.0)
Scrapy (1.0.3)
service-identity (14.0.
setuptools (18.2)
simplejson (3.3.3)
six (1.9.0)
south (0.8.4)
Twisted (15.3.0)
version (0.1.1)
w3lib (1.11.0)
zope.interface (4.1.2)

1 个答案:

答案 0 :(得分:1)

您的蜘蛛在settings.py文件中使用了DUPEFILTER('DUPEFILTER_CLASS': 'crawlers.utils.DuplicateArticleFilter'

Scrapy在尝试实例化dupefilter时抛出异常。尝试没有dupefilter的蜘蛛,看它是否会加载。

注意:在更新最新版本的Scrapy / Twisted的dupefilter之前,您的蜘蛛不会正确过滤重复的URL。但是,在不知道你来自哪个版本的Scrapy / Twisted并且没有看到设置/ dupefilter的代码的情况下,我们无法确定为什么会抛出异常。