我有一个scrapy脚本从网站下载图像。本地工作完美,也似乎在生产服务器上,但尽管没有收到任何错误,请不要保存图像。
这是生产服务器上的输出:
2013-07-10 05:12:33+0200 [scrapy] INFO: Scrapy 0.16.5 started (bot: mybot)
2013-07-10 05:12:33+0200 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-10 05:12:33+0200 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-10 05:12:33+0200 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-07-10 05:12:33+0200 [scrapy] DEBUG: Enabled item pipelines: CustomImagesPipeline
2013-07-10 05:12:33+0200 [bh] INFO: Spider opened
2013-07-10 05:12:33+0200 [bh] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-07-10 05:12:33+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-07-10 05:12:33+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-10 05:12:34+0200 [bh] DEBUG: Crawled (200) <GET http://www.mysite.com/find/brands.jsp> (referer: None)
2013-07-10 05:12:37+0200 [bh] DEBUG: Crawled (200) <GET http://www.mysite.com/c/browse/BrandName/ci/5732/N/4232860366> (referer: http://www.mysite.com/find/brands.jsp)
2013-07-10 05:12:41+0200 [bh] DEBUG: Crawled (200) <GET http://www.mysite.com/c/browse/Accessories-for-Camcorders/ci/5766/N/4232860347> (referer: http://www.mysite.com/c/browse/BrandName/ci/5732/N/4232860366)
2013-07-10 05:12:44+0200 [bh] DEBUG: Crawled (200) <GET http://www.mysite.com/c/buy/CategoryName/ci/5786/N/4232860316> (referer: http://www.mysite.com/c/browse/BrandName/ci/5732/N/4232860366)
2013-07-10 05:12:46+0200 [bh] DEBUG: Crawled (200) <GET http://www.mysite.com/images/images500x500/927001.jpg> (referer: None)
2013-07-10 05:12:46+0200 [bh] DEBUG: Image (downloaded): Downloaded image from <GET http://www.mysite.com/images/images500x500/927001.jpg> referred in <None>
2013-07-10 05:12:46+0200 [bh] DEBUG: Scraped from <200 http://www.mysite.com/c/buy/CategoryName/ci/5786/N/4232860316>
{'code': u'RFE234',
'image_urls': u'http://www.mysite.com/images/images500x500/927001.jpg',
'images': []}
2013-07-10 05:12:50+0200 [bh] DEBUG: Crawled (200) <GET http://www.mysite.com/images/images500x500/896290.jpg> (referer: None)
2013-07-10 05:12:50+0200 [bh] DEBUG: Image (downloaded): Downloaded image from <GET http://www.mysite.com/images/images500x500/896290.jpg> referred in <None>
2013-07-10 05:12:50+0200 [bh] DEBUG: Scraped from <200 http://www.mysite.com/c/buy/CategoryName/ci/5786/N/4232860316>
{'code': u'ABCD123',
'image_urls': u'http://www.mysite.com/images/images500x500/896290.jpg',
'images': []}
2013-07-10 05:13:18+0200 [bh] INFO: Closing spider (finished)
2013-07-10 05:13:18+0200 [bh] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 11107,
'downloader/request_count': 14,
'downloader/request_method_count/GET': 14,
'downloader/response_bytes': 527125,
'downloader/response_count': 14,
'downloader/response_status_count/200': 14,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 7, 10, 3, 13, 18, 673536),
'image_count': 10,
'image_status_count/downloaded': 10,
'item_scraped_count': 10,
'log_count/DEBUG': 40,
'log_count/INFO': 4,
'request_depth_max': 2,
'response_received_count': 14,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2013, 7, 10, 3, 12, 33, 367609)}
2013-07-10 05:13:18+0200 [bh] INFO: Spider closed (finished)
我注意到的区别是我的Item上的'images'变量是一个空列表[],而在本地通常是这样的:
2013-07-10 00:22:31-0300 [bh] DEBUG: Scraped from <200 http://www.mysite.com/c/buy/CategoryName/ci/5742/N/4232860364>
{'code': u'BGT453',
'image_urls': u'http://www.mysite.com/images/images500x500/834569.jpg',
'images': [{'checksum': 'ef2e2e42eeb06591bdfbdee568d29df1',
'path': u'bh/BGT453.jpg',
'url': 'http://www.mysite.com/images/images500x500/834569.jpg'}]}
主要问题是输出中没有错误,因此不知道如何解决问题。
我有PIL更新和相同的scrapy版本0.16.5和python 2.7
更新1
...
2013-07-10 06:48:50+0200 [scrapy] DEBUG: This is a DEBUG on CustomImagesPipeline !!
...
更新2 我创建了CustomImagesPipeline以使用产品代码作为文件名保存图像。我从ImagesPipeline复制了代码,但我只进行了一些更改。
from scrapy import log
from twisted.internet import defer, threads
from scrapy.http import Request
from cStringIO import StringIO
from PIL import Image
import time
from scrapy.contrib.pipeline.images import ImagesPipeline
class CustomImagesPipeline(ImagesPipeline):
def image_key(self, url, image_name):
path = 'bh/%s.jpg' % image_name
return path
def get_media_requests(self, item, info):
log.msg("This is a DEBUG on CustomImagesPipeline !! ", level=log.DEBUG)
yield Request(item['image_urls'], meta=dict(image_name=item['code']))
def get_images(self, response, request, info):
key = self.image_key(request.url, request.meta.get('image_name'))
orig_image = Image.open(StringIO(response.body))
width, height = orig_image.size
if width < self.MIN_WIDTH or height < self.MIN_HEIGHT:
raise ImageException("Image too small (%dx%d < %dx%d)" % (width, height, self.MIN_WIDTH, self.MIN_HEIGHT))
image, buf = self.convert_image(orig_image)
yield key, image, buf
for thumb_id, size in self.THUMBS.iteritems():
thumb_key = self.thumb_key(request.url, thumb_id)
thumb_image, thumb_buf = self.convert_image(image, size)
yield thumb_key, thumb_image, thumb_buf
def media_downloaded(self, response, request, info):
referer = request.headers.get('Referer')
if response.status != 200:
log.msg(format='Image (code: %(status)s): Error downloading image from %(request)s referred in <%(referer)s>',
level=log.WARNING, spider=info.spider,
status=response.status, request=request, referer=referer)
raise ImageException('download-error')
if not response.body:
log.msg(format='Image (empty-content): Empty image from %(request)s referred in <%(referer)s>: no-content',
level=log.WARNING, spider=info.spider,
request=request, referer=referer)
raise ImageException('empty-content')
status = 'cached' if 'cached' in response.flags else 'downloaded'
log.msg(format='Image (%(status)s): Downloaded image from %(request)s referred in <%(referer)s>',
level=log.DEBUG, spider=info.spider,
status=status, request=request, referer=referer)
self.inc_stats(info.spider, status)
try:
key = self.image_key(request.url, request.meta.get('image_name'))
checksum = self.image_downloaded(response, request, info)
except ImageException as exc:
whyfmt = 'Image (error): Error processing image from %(request)s referred in <%(referer)s>: %(errormsg)s'
log.msg(format=whyfmt, level=log.WARNING, spider=info.spider,
request=request, referer=referer, errormsg=str(exc))
raise
except Exception as exc:
whyfmt = 'Image (unknown-error): Error processing image from %(request)s referred in <%(referer)s>'
log.err(None, whyfmt % {'request': request, 'referer': referer}, spider=info.spider)
raise ImageException(str(exc))
return {'url': request.url, 'path': key, 'checksum': checksum}
def media_to_download(self, request, info):
def _onsuccess(result):
if not result:
return # returning None force download
last_modified = result.get('last_modified', None)
if not last_modified:
return # returning None force download
age_seconds = time.time() - last_modified
age_days = age_seconds / 60 / 60 / 24
if age_days > self.EXPIRES:
return # returning None force download
referer = request.headers.get('Referer')
log.msg(format='Image (uptodate): Downloaded %(medianame)s from %(request)s referred in <%(referer)s>',
level=log.DEBUG, spider=info.spider,
medianame=self.MEDIA_NAME, request=request, referer=referer)
self.inc_stats(info.spider, 'uptodate')
checksum = result.get('checksum', None)
return {'url': request.url, 'path': key, 'checksum': checksum}
key = self.image_key(request.url, request.meta.get('image_name'))
dfd = defer.maybeDeferred(self.store.stat_image, key, info)
dfd.addCallbacks(_onsuccess, lambda _: None)
dfd.addErrback(log.err, self.__class__.__name__ + '.store.stat_image')
return dfd
本地系统Mac OSX,生产服务器Debian GNU / Linux 7(wheezy)
答案 0 :(得分:0)
来自docs:
图像列表中的图像字段将保留与原始image_urls字段相同的顺序。如果某些图像下载失败,将记录错误,图像将不会出现在图像字段中。
似乎是logging must be explicitly enabled,例如:
from scrapy import log
log.msg("This is a warning", level=log.WARNING)
因此,请启用日志记录,并编辑您的问题以包含您收到的错误。
答案 1 :(得分:-1)
我们已经解决了重新安装此软件包的问题:
apt-get install python
apt-get install python-scrapy
apt-get install python-openssl
apt-get install python-pip
# Remove scrapy because probably was corrupted
apt-get remove python-scrapy
apt-get install python-dev
pip install Scrapy
谢谢!