Question

早上好，

执行我的一个蜘蛛时出现连接错误：

2014-02-28 10:21:00+0400 [butik] DEBUG: Retrying <GET http://www.butik.ru/> (failed 1 times): An error occurred while connecting: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion: Connection lost.].

然后蜘蛛关闭了。

所有其他具有熟悉结构的蜘蛛都在顺利运行，但是这一个：

class butik(Spider):
    name = "butik"
    allowed_domains = ['butik.ru']
    start_urls      = ['http://www.butik.ru/']

    def parse(self, response): 
        sel = Selector(response)
        print response.url
        maincats = sel.xpath('//div[@id="main_menu"]//a/@href').extract()
        for maincat in maincats:
            maincat = 'http://www.butik.ru'+ maincat 
            request = Request(maincat, callback=self.categories)
            yield request

我很无能为解决这个问题采取哪些步骤，我很高兴任何提示和答案。如果需要其他信息，我很乐意提供必要的代码。

提前致谢

Ĵ

Answer 1

您可以尝试使用urllib2。当我使用scrapy抓取网页时，我也遇到了类似的问题，但我在urllib2内使用parse解决了这个问题：

import urllib2

def parse(self,response):
    # ...
    url = 'www.example.com'
    req = urllib2.Request(url,data)
    response = urllib2.urlopen(req)
    the_page = response.read()
    # ...

Scrapy 0.22：连接时发生错误：<class'wingted.internet.error.connectionlost'=“”> </class>

1 个答案: