我在python中运行Scrapy蜘蛛来从网站上抓取图像。其中一个图像无法下载(即使我尝试通过网站定期下载),这是该网站的内部错误。这很好,我不关心尝试获取图像,我只是想在图像失败时跳过图像并移动到其他图像上,但我不断收到10054错误。
> Traceback (most recent call last): File
> "c:\python27\lib\site-packages\twisted\internet\defer.py", line 588,
> in _runCallbacks
> current.result = callback(current.result, *args, **kw) File "C:\Python27\Scripts\nhtsa\nhtsa\spiders\NHTSA_spider.py", line 137,
> in parse_photo_page
> self.retrievePhoto(base_url_photo + url[0], url_text) File "C:\Python27\Scripts\nhtsa\nhtsa\retrying.py", line 49, in wrapped_f
> return Retrying(*dargs, **dkw).call(f, *args, **kw) File "C:\Python27\Scripts\nhtsa\nhtsa\retrying.py", line 212, in call
> raise attempt.get() File "C:\Python27\Scripts\nhtsa\nhtsa\retrying.py", line 247, in get
> six.reraise(self.value[0], self.value[1], self.value[2]) File "C:\Python27\Scripts\nhtsa\nhtsa\retrying.py", line 200, in call
> attempt = Attempt(fn(*args, **kwargs), attempt_number, False) File "C:\Python27\Scripts\nhtsa\nhtsa\spiders\NHTSA_spider.py", line
> 216, in retrievePhoto
> code.write(f.read()) File "c:\python27\lib\socket.py", line 355, in read
> data = self._sock.recv(rbufsize) File "c:\python27\lib\httplib.py", line 612, in read
> s = self.fp.read(amt) File "c:\python27\lib\socket.py", line 384, in read
> data = self._sock.recv(left) error: [Errno 10054] An existing connection was forcibly closed by the remote
这是我的解析功能,它会查看照片页面并找到重要的网址:
def parse_photo_page(self, response):
for sel in response.xpath('//table[@id="tblData"]/tr'):
url = sel.xpath('td/font/a/@href').extract()
table_fields = sel.xpath('td/font/text()').extract()
if url:
base_url_photo = "http://www-nrd.nhtsa.dot.gov/"
url_text = table_fields[3]
url_text = string.replace(url_text, " ","")
url_text = string.replace(url_text," ","")
self.retrievePhoto(base_url_photo + url[0], url_text)
这是我的下载函数,带有重试装饰器:
from retrying import retry
@retry(stop_max_attempt_number=5, wait_fixed=2000)
def retrievePhoto(self, url, filename):
fullPath = self.saveLocation + "/" + filename
urllib.urlretrieve(url, fullPath)
它重试下载5次,但随后抛出10054错误并且不会继续下一个图像。重试后如何让蜘蛛继续?再一次,我不在乎下载问题图像,我只是想跳过它。
答案 0 :(得分:1)
在scrapy中你不应该使用urllib
是正确的,因为它阻止了一切。尝试阅读与" scrapy twisted"相关的资源。和" scrapy异步"。无论如何......我不相信你的主要问题是"重试后继续"但没有使用"相关的xpaths"在你的表达。这是一个适合我的版本(请注意./
中的'./td/font/a/@href'
):
import scrapy
import string
import urllib
import os
class MyspiderSpider(scrapy.Spider):
name = "myspider"
start_urls = (
'file:index.html',
)
saveLocation = os.getcwd()
def parse(self, response):
for sel in response.xpath('//table[@id="tblData"]/tr'):
url = sel.xpath('./td/font/a/@href').extract()
table_fields = sel.xpath('./td/font/text()').extract()
if url:
base_url_photo = "http://www-nrd.nhtsa.dot.gov/"
url_text = table_fields[3]
url_text = string.replace(url_text, " ","")
url_text = string.replace(url_text," ","")
self.retrievePhoto(base_url_photo + url[0], url_text)
from retrying import retry
@retry(stop_max_attempt_number=5, wait_fixed=2000)
def retrievePhoto(self, url, filename):
fullPath = self.saveLocation + "/" + filename
urllib.urlretrieve(url, fullPath)
这是一个(更好)版本,它遵循你的模式,但使用@paul trmbrth提到的ImagesPipeline
。
import scrapy
import string
import os
class MyspiderSpider(scrapy.Spider):
name = "myspider2"
start_urls = (
'file:index.html',
)
saveLocation = os.getcwd()
custom_settings = {
"ITEM_PIPELINES": {'scrapy.pipelines.images.ImagesPipeline': 1},
"IMAGES_STORE": saveLocation
}
def parse(self, response):
image_urls = []
image_texts = []
for sel in response.xpath('//table[@id="tblData"]/tr'):
url = sel.xpath('./td/font/a/@href').extract()
table_fields = sel.xpath('./td/font/text()').extract()
if url:
base_url_photo = "http://www-nrd.nhtsa.dot.gov/"
url_text = table_fields[3]
url_text = string.replace(url_text, " ","")
url_text = string.replace(url_text," ","")
image_urls.append(base_url_photo + url[0])
image_texts.append(url_text)
return {"image_urls": image_urls, "image_texts": image_texts}
我使用的演示文件是:
$ cat index.html
<table id="tblData"><tr>
<td><font>hi <a href="img/2015/cav.jpg"> foo </a> <span /> <span /> green.jpg </font></td>
</tr><tr>
<td><font>hi <a href="img/2015/caw.jpg"> foo </a> <span /> <span /> blue.jpg </font></td>
</tr></table>