Scrapy没有下载图像

时间:2018-07-06 18:30:31

标签: python image file download scrapy

我正在尝试使用Scrapy下载一些图像。我遵循了官方文档,复制并粘贴了一些示例,并阅读了许多类似的问题,但现在仍然可以使用。 我想念什么?

我注意到项目管道看起来是空的,但我无法弄清楚。

2018-07-06 20:10:18 [scrapy.middleware] INFO: Enabled item pipelines: []

此外,我尝试了不同的网站,使用标头,但是没有。看起来好像正在运行,但是没有文件被保存。

在这里,我发布了用于测试此功能的代码。

myspider.py:

class ImageSpider(scrapy.Spider):
    name = "imagespider"

    start_urls = [
        "http://www.upv.es/",
    ]

    def parse(self, response):
        for elem in response.xpath("//img"):
            img_url = elem.xpath("@src").extract_first()
            yield ImageItem(image_urls=[img_url]) # Not working
            #yield {'image_urls': [img_url]}  # Not working

items.py:

class ImageItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()

settings.py:

ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1,
}

IMAGES_STORE = '/Users/salva/Desktop/demo/demo/temp'

控制台:

2018-07-06 20:10:18 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2018-07-06 20:10:18 [scrapy.utils.log] INFO: Versions: lxml 4.2.3.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.4.0, Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 03:03:55) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h  27 Mar 2018), cryptography 2.2.2, Platform Darwin-17.6.0-x86_64-i386-64bit
2018-07-06 20:10:18 [scrapy.crawler] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2018-07-06 20:10:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2018-07-06 20:10:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-07-06 20:10:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-07-06 20:10:18 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-07-06 20:10:18 [scrapy.core.engine] INFO: Spider opened
2018-07-06 20:10:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-07-06 20:10:18 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-07-06 20:10:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.upv.es/> (referer: None)
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/GRi.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/GRi.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/marcaUPVN1.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/img_identif.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/menu-hamburguesa2.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/menu-hamburguesa.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/espacio2.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/icon-desplegar_GR.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/icon-plegar_GR.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/ico_nueva_ventana.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/icon-desplegar.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/icon-desplegar.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/icon-desplegar.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/ico_nueva_ventana.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/icon-desplegar.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/icon-desplegar.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pcarrusel/slider_valentia_hyperloop2.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pcarrusel/slider_campus_109.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pcarrusel/slider_fsupv04_michigan.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pnoticias/icono_escuelas_fba_008.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pnoticias/icono_gente_campus_118.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pnoticias/icono_institutos_002.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pnoticias/icon_posgrado.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pnoticias/icono_alumnos_tecnologia_051.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pnoticias/icono_gente_campus_119.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/ppromo/promo_vida_universitaria.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/ppromo/promo_deportes3.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/ppromo/promo_alojamiento.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/ppromo/promo_valencia.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/ico_videoplayer_pvideos.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/mulet3-1.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/ico_videoplayer_pvideos.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/corma.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/ico_videoplayer_pvideos.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/andy.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/ico_videoplayer_pvideos.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/san_nicolas.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/ico_videoplayer_pvideos.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/formula.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/ico_videoplayer_pvideos.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/eco_sensor.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/icono_Riunet.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/icono_upvX.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/icono_poliConsulta.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/icono_poliAPPS.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/rs-twitter.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/rs-facebook.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/rs-linkedin.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/rs-instagram.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/rs-youtube.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/rs-google-plus.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/campus_excelencia-2WH.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/EMASupv-WH.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/xarxa_vives.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/universia_cl.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/forum_unesco_cl.png']}
2018-07-06 20:10:18 [scrapy.core.engine] INFO: Closing spider (finished)
2018-07-06 20:10:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 225,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 53981,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 7, 6, 18, 10, 18, 744230),
 'item_scraped_count': 56,
 'log_count/DEBUG': 58,
 'log_count/INFO': 7,
 'memusage/max': 103243776,
 'memusage/startup': 103239680,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 7, 6, 18, 10, 18, 355192)}
2018-07-06 20:10:18 [scrapy.core.engine] INFO: Spider closed (finished)

2 个答案:

答案 0 :(得分:0)

它正在按照指示从主链接中抓取,但您没有将源链接和主链接并置。请尝试这样的操作(未经测试):

def parse(self, response):
    for elem in response.xpath("//img"):
        img_url = elem.xpath("@src").extract_first()
        yield ImageItem(image_urls=[start_urls+img_url])

答案 1 :(得分:0)

当我从终端(使用wget https://repo.anaconda.com/archive/Anaconda3-5.2.0-Linux-x86_64.sh -O ~/anaconda.sh bash ~/anaconda.sh -b -p $HOME/anaconda echo 'export PATH="$HOME/anaconda/bin:$PATH"' >>~/.bash_profile source .bash_profile 运行Spider时,它起作用,但是当我从脚本(scrapy crawl myspider)运行Spider时,它不起作用。

请参见https://github.com/scrapy/scrapy/issues/1904