我正在尝试下载特定主页中使用的所有图片。 Scrapy报告说它刮掉了图像,但文件不会在指定的 library(dplyr)
#matrices replication
iris1=iris
iris2=iris
iris3=iris
#list of combinations: apply is tricky for array input
irises=matrix(c("iris1","iris2","iris3"), ncol=1)
#function design
Funct<-function(df_name){
df=get(df_name)
df %>% mutate(Sepal=rowSums(select(.,starts_with("Sepal"))),
Length=rowSums(select(.,ends_with("Length"))),
Width=rowSums(select(.,ends_with("Width"))))
}
apply(irises,MARGIN=2, Funct)
目录中结束。我错过了什么吗?
蜘蛛:
IMAGES_STORE
items.py:
class FakeSpider(scrapy.Spider):
name = "fake"
custom_settings = {
'ITEM_PIPELINES': {'scrapy.pipelines.images.ImagesPipeline': 1},
'IMAGES_STORE': 'images'
}
def parse_page(self, response):
for elem in response.xpath("//img"):
img_url = elem.xpath("@src").extract_first()
yield ImageItem(image_urls=[img_url], images=[img_url])])
pipelines.py:
class ImageItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
输出摘录:
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield scrapy.Request(image_url)
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
return item
答案 0 :(得分:0)
这需要在settings.py
否?为什么他们在蜘蛛类
custom_settings = {
'ITEM_PIPELINES': {'scrapy.pipelines.images.ImagesPipeline': 1},
'IMAGES_STORE': 'images'
}
settings.py
ITEM_PIPELINES: {'scrapy.pipelines.images.ImagesPipeline': 1},
IMAGES_STORE: 'images'
<强>更新强>
无需在此处传递images=[img_url]
:
# wrong
yield ImageItem(image_urls=[img_url], images=[img_url])])
# how need to be
yield ImageItem(image_urls=[img_url])
# also in `img_url` need to be full path of the image for example
# this is image src
# full/28c31bdd751cae30bbfdf641d82d2de0c64af653.jpg'
# it need to be url of the site + image src for example this is right image url
# http://books.toscrape.com/media/cache/09/a3/09a3aef48557576e1a85ba7efea8ecb7.jpg
# change this to work
def parse_page(self, response):
for elem in response.xpath("//img"):
img_url = elem.xpath("@src").extract_first()
img_url = response.url + img_url # here
yield ImageItem(image_urls=[img_url])