Scrapy图像管道不下载图像,但显示它已刮掉它

时间:2018-06-14 13:41:09

标签: python scrapy

我正在尝试下载特定主页中使用的所有图片。 Scrapy报告说它刮掉了图像,但文件不会在指定的 library(dplyr) #matrices replication iris1=iris iris2=iris iris3=iris #list of combinations: apply is tricky for array input irises=matrix(c("iris1","iris2","iris3"), ncol=1) #function design Funct<-function(df_name){ df=get(df_name) df %>% mutate(Sepal=rowSums(select(.,starts_with("Sepal"))), Length=rowSums(select(.,ends_with("Length"))), Width=rowSums(select(.,ends_with("Width")))) } apply(irises,MARGIN=2, Funct) 目录中结束。我错过了什么吗?

蜘蛛:

IMAGES_STORE

items.py:

class FakeSpider(scrapy.Spider):
name = "fake"

custom_settings = {
    'ITEM_PIPELINES': {'scrapy.pipelines.images.ImagesPipeline': 1},
    'IMAGES_STORE': 'images'
}
def parse_page(self, response):
    for elem in response.xpath("//img"):
        img_url = elem.xpath("@src").extract_first()
        yield ImageItem(image_urls=[img_url], images=[img_url])])

pipelines.py:

class ImageItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()

输出摘录:

class MyImagesPipeline(ImagesPipeline):

def get_media_requests(self, item, info):
    for image_url in item['image_urls']:
        yield scrapy.Request(image_url)

def item_completed(self, results, item, info):
    image_paths = [x['path'] for ok, x in results if ok]
    if not image_paths:
        raise DropItem("Item contains no images")
    item['image_paths'] = image_paths
    return item

1 个答案:

答案 0 :(得分:0)

这需要在settings.py否?为什么他们在蜘蛛类

custom_settings = {
    'ITEM_PIPELINES': {'scrapy.pipelines.images.ImagesPipeline': 1},
    'IMAGES_STORE': 'images'
}

settings.py

ITEM_PIPELINES: {'scrapy.pipelines.images.ImagesPipeline': 1},
IMAGES_STORE: 'images'

<强>更新

无需在此处传递images=[img_url]

# wrong
yield ImageItem(image_urls=[img_url], images=[img_url])])

# how need to be 
yield ImageItem(image_urls=[img_url])

# also in `img_url` need to be full path of the image for example 
# this is image src
# full/28c31bdd751cae30bbfdf641d82d2de0c64af653.jpg'
# it need to be url of the site + image src for example this is right image url
# http://books.toscrape.com/media/cache/09/a3/09a3aef48557576e1a85ba7efea8ecb7.jpg

# change this to work 
def parse_page(self, response):
    for elem in response.xpath("//img"):
        img_url = elem.xpath("@src").extract_first()
        img_url = response.url + img_url # here
        yield ImageItem(image_urls=[img_url])