scrapy中的端口错误

时间:2013-07-04 15:21:36

标签: twisted scrapy web-crawler

我设计了一个爬行器,里面有两只蜘蛛。我用scrapy设计了它们  这些蜘蛛将通过从数据库中获取数据而独立运行。

我们正在使用反应堆运行这些蜘蛛。我们知道我们不能反复运行反应堆 我们给第二个爬行的蜘蛛提供了500多个链接。 如果我们这样做,我们就会遇到端口错误的问题。即scrapy只使用单个端口

Error caught on signal handler: <bound method ?.start_listening of <scrapy.telnet.TelnetConsole instance at 0x0467B440>>
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 1070, in _inlineCallbacks
result = g.send(result)
File "C:\Python27\lib\site-packages\scrapy-0.16.5-py2.7.egg\scrapy\core\engine.py", line 75, in start yield self.signals.send_catch_log_deferred(signal=signals.engine_started)
File "C:\Python27\lib\site-packages\scrapy-0.16.5-py2.7.egg\scrapy\signalmanager.py", line 23, in send_catch_log_deferred
return signal.send_catch_log_deferred(*a, **kw)
File "C:\Python27\lib\site-packages\scrapy-0.16.5-py2.7.egg\scrapy\utils\signal.py", line 53, in send_catch_log_deferred
*arguments, **named)
--- <exception caught here> ---
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 137, in maybeDeferred
result = f(*args, **kw)
File "C:\Python27\lib\site-packages\scrapy-0.16.5-py2.7.egg\scrapy\xlib\pydispatch\robustapply.py", line 47, in robustApply
return receiver(*arguments, **named)
File "C:\Python27\lib\site-packages\scrapy-0.16.5-py2.7.egg\scrapy\telnet.py", line 47, in start_listening
self.port = listen_tcp(self.portrange, self.host, self)
File "C:\Python27\lib\site-packages\scrapy-0.16.5-py2.7.egg\scrapy\utils\reactor.py", line 14, in listen_tcp
return reactor.listenTCP(x, factory, interface=host)
File "C:\Python27\lib\site-packages\twisted\internet\posixbase.py", line 489, in listenTCP
p.startListening()
File "C:\Python27\lib\site-packages\twisted\internet\tcp.py", line 980, in startListening
raise CannotListenError(self.interface, self.port, le)
twisted.internet.error.CannotListenError: Couldn't listen on 0.0.0.0:6073: [Errno 10048] Only one usage of each socket address (protocol/network address/port) is normally permitted.

那么这里出现的问题是什么?那么解决这种情况的最佳方法是什么?请帮助......

p.s:我增加了设置中的端口数量,但默认情况下总是占用6073.

2 个答案:

答案 0 :(得分:7)

最简单的方法是通过将此添加到settings.py

来禁用Telnet控制台
EXTENSIONS = {
   'scrapy.telnet.TelnetConsole': None
}

另请参阅http://doc.scrapy.org/en/latest/topics/settings.html#extensions以获取默认启用的扩展名列表。

答案 1 :(得分:2)

通过运行较少的并发抓取工具可以解决您的问题。这是我为顺序提出请求而编写的一个食谱: 这个特定的类只运行一个爬虫,但是让它运行批处理所需的修改(比如一次10个)是微不足道的。

class SequentialCrawlManager(object):
    """Start spiders sequentially"""

    def __init__(self, spider, websites):
        self.spider = spider
        self.websites = websites
        # setup crawler
        self.settings = get_project_settings()
        self.current_site_idx = 0

    def next_site(self):
        if self.current_site_idx < len(self.websites):
            self.crawler = Crawler(self.settings)
            # the CSVs data in each column is passed as keyword arguments
            # the arguments come from the
            spider = self.spider() # pass arguments if desired
            self.crawler.crawl(spider)
            self.crawler.start()
            # wait for one spider to finish before starting the next one
            self.crawler.signals.connect(self.next_site, signal=signals.spider_closed)
            self.crawler.configure()
            self.current_site_idx += 1
        else:
            reactor.stop() # required for the program to terminate

    def start(self):
        log.start()
        self.next_site()
        reactor.run() # blocking call