Scrapy:重试图像下载后出现错误10054

时间:2016-03-07 19:52:53

标签: python scrapy urllib

我在python中运行Scrapy蜘蛛来从网站上抓取图像。其中一个图像无法下载(即使我尝试通过网站定期下载),这是该网站的内部错误。这很好,我不关心尝试获取图像,我只是想在图像失败时跳过图像并移动到其他图像上,但我不断收到10054错误。

> Traceback (most recent call last):   File
> "c:\python27\lib\site-packages\twisted\internet\defer.py", line 588,
> in _runCallbacks
>     current.result = callback(current.result, *args, **kw)   File "C:\Python27\Scripts\nhtsa\nhtsa\spiders\NHTSA_spider.py", line 137,
> in parse_photo_page
>     self.retrievePhoto(base_url_photo + url[0], url_text)   File "C:\Python27\Scripts\nhtsa\nhtsa\retrying.py", line 49, in wrapped_f
>     return Retrying(*dargs, **dkw).call(f, *args, **kw)   File "C:\Python27\Scripts\nhtsa\nhtsa\retrying.py", line 212, in call
>     raise attempt.get()   File "C:\Python27\Scripts\nhtsa\nhtsa\retrying.py", line 247, in get
>     six.reraise(self.value[0], self.value[1], self.value[2])   File "C:\Python27\Scripts\nhtsa\nhtsa\retrying.py", line 200, in call
>     attempt = Attempt(fn(*args, **kwargs), attempt_number, False)   File "C:\Python27\Scripts\nhtsa\nhtsa\spiders\NHTSA_spider.py", line
> 216, in retrievePhoto
>     code.write(f.read())   File "c:\python27\lib\socket.py", line 355, in read
>     data = self._sock.recv(rbufsize)   File "c:\python27\lib\httplib.py", line 612, in read
>     s = self.fp.read(amt)   File "c:\python27\lib\socket.py", line 384, in read
>     data = self._sock.recv(left) error: [Errno 10054] An existing connection was forcibly closed by the remote

这是我的解析功能,它会查看照片页面并找到重要的网址:

def parse_photo_page(self, response):
        for sel in response.xpath('//table[@id="tblData"]/tr'):
            url = sel.xpath('td/font/a/@href').extract()
            table_fields = sel.xpath('td/font/text()').extract()
            if url:
                base_url_photo = "http://www-nrd.nhtsa.dot.gov/"
                url_text = table_fields[3]
                url_text = string.replace(url_text, "&nbsp","")
                url_text = string.replace(url_text," ","")  
                self.retrievePhoto(base_url_photo + url[0], url_text)

这是我的下载函数,带有重试装饰器:

from retrying import retry
@retry(stop_max_attempt_number=5, wait_fixed=2000)
    def retrievePhoto(self, url, filename): 
        fullPath = self.saveLocation + "/" + filename
        urllib.urlretrieve(url, fullPath)

它重试下载5次,但随后抛出10054错误并且不会继续下一个图像。重试后如何让蜘蛛继续?再一次,我不在乎下载问题图像,我只是想跳过它。

1 个答案:

答案 0 :(得分:1)

在scrapy中你不应该使用urllib是正确的,因为它阻止了一切。尝试阅读与" scrapy twisted"相关的资源。和" scrapy异步"。无论如何......我不相信你的主要问题是"重试后继续"但没有使用"相关的xpaths"在你的表达。这是一个适合我的版本(请注意./中的'./td/font/a/@href'):

import scrapy
import string
import urllib
import os

class MyspiderSpider(scrapy.Spider):
    name = "myspider"
    start_urls = (
        'file:index.html',
    )

    saveLocation = os.getcwd()

    def parse(self, response):
        for sel in response.xpath('//table[@id="tblData"]/tr'):
            url = sel.xpath('./td/font/a/@href').extract()
            table_fields = sel.xpath('./td/font/text()').extract()
            if url:
                base_url_photo = "http://www-nrd.nhtsa.dot.gov/"
                url_text = table_fields[3]
                url_text = string.replace(url_text, "&nbsp","")
                url_text = string.replace(url_text," ","")
                self.retrievePhoto(base_url_photo + url[0], url_text)

    from retrying import retry
    @retry(stop_max_attempt_number=5, wait_fixed=2000)
    def retrievePhoto(self, url, filename): 
        fullPath = self.saveLocation + "/" + filename
        urllib.urlretrieve(url, fullPath)

这是一个(更好)版本,它遵循你的模式,但使用@paul trmbrth提到的ImagesPipeline

import scrapy
import string
import os

class MyspiderSpider(scrapy.Spider):
    name = "myspider2"
    start_urls = (
        'file:index.html',
    )

    saveLocation = os.getcwd()

    custom_settings = {
        "ITEM_PIPELINES": {'scrapy.pipelines.images.ImagesPipeline': 1},
        "IMAGES_STORE": saveLocation
    }

    def parse(self, response):
        image_urls = []
        image_texts = []
        for sel in response.xpath('//table[@id="tblData"]/tr'):
            url = sel.xpath('./td/font/a/@href').extract()
            table_fields = sel.xpath('./td/font/text()').extract()
            if url:
                base_url_photo = "http://www-nrd.nhtsa.dot.gov/"
                url_text = table_fields[3]
                url_text = string.replace(url_text, "&nbsp","")
                url_text = string.replace(url_text," ","")
                image_urls.append(base_url_photo + url[0])
                image_texts.append(url_text)

        return {"image_urls": image_urls, "image_texts": image_texts}

我使用的演示文件是:

$ cat index.html 
<table id="tblData"><tr>

<td><font>hi <a href="img/2015/cav.jpg"> foo </a> <span /> <span /> green.jpg     </font></td>

</tr><tr>

<td><font>hi <a href="img/2015/caw.jpg"> foo </a> <span /> <span /> blue.jpg     </font></td>

</tr></table>