Question

我对scrapy和Pillow有点小错误。知道他们有很多“相同”的问题，但我尝试了所有我找到的，但这不起作用..

我使用scrapy来解析许多网站，超过10万个网页。我已经创建了一个管道，用于定义页面是否包含图像，如果，它下载图片并在同一路径上创建缩略图。使用它是因为如果创建缩略图失败，我就是“大”版本的图像。

这里有一些代码

from PIL import Image
from slugify import slugify

class DownloadImageOnDisk( object ):
    def process_item( self, item, spider ):
        try:
            # If image on page
            if item[ 'image' ]:
                img     = item[ 'image' ]
                # Get extension of image
                ext     = img.split( '.' )
                ext     = ext[ -1 ].split('?')
                ext     = ext[0]
                key     = self.remove_accents( item[ 'imagetitle' ] ).encode( 'utf-8', 'replace' )
                path    = settings[ 'IMG_PATH' ] + item[ 'website' ] + '/' + key + '.' + ext

                # Create dir
                if not os.path.exists( settings['IMG_PATH'] + item['website'] ):
                    os.makedirs( settings[ 'IMG_PATH' ] + item[ 'website' ] )

                # Check if image not already exist
                if not os.path.isfile( path ):
                    # Download big image
                    urllib.urlretrieve( img, path )
                    if os.path.isfile( path ):
                        # Create thumb
                        self.optimize_image( path )

                item[ 'image' ] = item[ 'website' ] + '/' + key + '.' + ext

            return item
        except Exception as exc:
            pass

    # Slugify path
    def remove_accents( self, input_str ):
        try:
            return slugify( input_str )
        except Exception as exc:
            raise DropItem( exc )

    # Create thumb
    def optimize_image( self, path ):
        try:
            image = Image.open( path )
            image.thumbnail( ( 200,200 ), Image.ANTIALIAS )
            image.save( path, optimize=True, quality=85 )
        except IOError  as exc:
            raise DropItem( exc )
        except Exception as exc:
            raise DropItem( exc )

但有时候，不是常规的（我认为有100个项目）我有这个错误

cannot identify image file '/PATH/NAME.jpg'

在 optimize_image 功能上。当我检查磁盘时我存在图像，它已经存在。

我真的不明白......

我有任何建议。

提前致谢

Answer 1

不确定，但似乎已经解决了

import requests
import io
...
response = requests.get( img )
image = Image.open(io.BytesIO(response.content))
image.thumbnail( ( 200,200 ), Image.ANTIALIAS )
image.save( path, optimize=True, quality=85 )

我继续我的考试

枕头+ scrapy =有时无法识别图像文件

1 个答案: