我正在尝试使用Scrapy下载一些图像。我遵循了官方文档,复制并粘贴了一些示例,并阅读了许多类似的问题,但现在仍然可以使用。 我想念什么?
我注意到项目管道看起来是空的,但我无法弄清楚。
2018-07-06 20:10:18 [scrapy.middleware] INFO: Enabled item pipelines: []
此外,我尝试了不同的网站,使用标头,但是没有。看起来好像正在运行,但是没有文件被保存。
在这里,我发布了用于测试此功能的代码。
myspider.py:
class ImageSpider(scrapy.Spider):
name = "imagespider"
start_urls = [
"http://www.upv.es/",
]
def parse(self, response):
for elem in response.xpath("//img"):
img_url = elem.xpath("@src").extract_first()
yield ImageItem(image_urls=[img_url]) # Not working
#yield {'image_urls': [img_url]} # Not working
items.py:
class ImageItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
settings.py:
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = '/Users/salva/Desktop/demo/demo/temp'
控制台:
2018-07-06 20:10:18 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2018-07-06 20:10:18 [scrapy.utils.log] INFO: Versions: lxml 4.2.3.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.4.0, Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 03:03:55) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.2.2, Platform Darwin-17.6.0-x86_64-i386-64bit
2018-07-06 20:10:18 [scrapy.crawler] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2018-07-06 20:10:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2018-07-06 20:10:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-07-06 20:10:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-07-06 20:10:18 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-07-06 20:10:18 [scrapy.core.engine] INFO: Spider opened
2018-07-06 20:10:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-07-06 20:10:18 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-07-06 20:10:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.upv.es/> (referer: None)
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/GRi.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/GRi.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/marcaUPVN1.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/img_identif.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/menu-hamburguesa2.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/menu-hamburguesa.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/espacio2.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/icon-desplegar_GR.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/icon-plegar_GR.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/ico_nueva_ventana.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/icon-desplegar.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/icon-desplegar.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/icon-desplegar.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/ico_nueva_ventana.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/icon-desplegar.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/icon-desplegar.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pcarrusel/slider_valentia_hyperloop2.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pcarrusel/slider_campus_109.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pcarrusel/slider_fsupv04_michigan.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pnoticias/icono_escuelas_fba_008.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pnoticias/icono_gente_campus_118.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pnoticias/icono_institutos_002.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pnoticias/icon_posgrado.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pnoticias/icono_alumnos_tecnologia_051.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pnoticias/icono_gente_campus_119.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/ppromo/promo_vida_universitaria.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/ppromo/promo_deportes3.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/ppromo/promo_alojamiento.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/ppromo/promo_valencia.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/ico_videoplayer_pvideos.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/mulet3-1.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/ico_videoplayer_pvideos.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/corma.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/ico_videoplayer_pvideos.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/andy.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/ico_videoplayer_pvideos.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/san_nicolas.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/ico_videoplayer_pvideos.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/formula.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/ico_videoplayer_pvideos.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/eco_sensor.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/icono_Riunet.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/icono_upvX.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/icono_poliConsulta.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/icono_poliAPPS.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/rs-twitter.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/rs-facebook.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/rs-linkedin.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/rs-instagram.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/rs-youtube.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/rs-google-plus.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/campus_excelencia-2WH.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/EMASupv-WH.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/xarxa_vives.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/universia_cl.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/forum_unesco_cl.png']}
2018-07-06 20:10:18 [scrapy.core.engine] INFO: Closing spider (finished)
2018-07-06 20:10:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 225,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 53981,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 7, 6, 18, 10, 18, 744230),
'item_scraped_count': 56,
'log_count/DEBUG': 58,
'log_count/INFO': 7,
'memusage/max': 103243776,
'memusage/startup': 103239680,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2018, 7, 6, 18, 10, 18, 355192)}
2018-07-06 20:10:18 [scrapy.core.engine] INFO: Spider closed (finished)
答案 0 :(得分:0)
它正在按照指示从主链接中抓取,但您没有将源链接和主链接并置。请尝试这样的操作(未经测试):
def parse(self, response):
for elem in response.xpath("//img"):
img_url = elem.xpath("@src").extract_first()
yield ImageItem(image_urls=[start_urls+img_url])
答案 1 :(得分:0)
当我从终端(使用wget https://repo.anaconda.com/archive/Anaconda3-5.2.0-Linux-x86_64.sh -O ~/anaconda.sh
bash ~/anaconda.sh -b -p $HOME/anaconda
echo 'export PATH="$HOME/anaconda/bin:$PATH"' >>~/.bash_profile
source .bash_profile
运行Spider时,它起作用,但是当我从脚本(scrapy crawl myspider
)运行Spider时,它不起作用。