从网页上刮取图像

时间:2018-05-09 10:58:07

标签: python image web-scraping scrapy

我试图使用Scrapy从网页下载图片,问题是页面有点复杂,所以我认为我没有正确定义xpath。 蜘蛛运行没有错误,但是,没有图像保存到我指定的文件夹中。我是scrapy的新手,所以我在这一点上陷入困​​境。网页在代码内...任何建议都会很受欢迎。谢谢!

import scrapy
import re
import urlparse
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from scrapy.pipelines.images import ImagesPipeline
from imagetest.items import ImagetestItem
from imagetest.items import ImageItem

custom_settings = {
        "ITEM_PIPELINES": {'scrapy.pipelines.images.ImagesPipeline': 1},
        "IMAGES_STORE": '/Users/jose/Desktop'
    }

class imagetestSpider(scrapy.Spider):
   name = "imagetest"
   allowed_domains = ["dataviz.worldbank.org", "worldbank.org"]
   start_urls = ["http://dataviz.worldbank.org/t/DECDG/views/IFC4/DSHB_MCT02h?:embed=y&:display_count=no%22;%20break"]

   def parse(self, response):
    for sel in response.xpath("//html/body"):
        item = ImagetestItem()
            img_url = sel.xpath('//*[@idd="view2429655783888868492_8474849571036086192"]/div[1]/div[2]/img').extract()
            yield scrapy.Request(urlparse.urljoin(response.url, img_url),callback=self.parseImages,  meta={'item': item})

   def parseImages(self, response):
    for elem in response.xpath("//img"):
            img_url = elem.xpath("@src").extract_first()
            yield ImageItem(image_urls=[img_url])

0 个答案:

没有答案