编辑2 - 因为我的文件夹与我选择的名字混在一起,所以我不小心发布了错误的代码。请参阅下面的每个文件的准确代码,以获取包含我所有文件的正确文件夹。
设置
# -*- coding: utf-8 -*-
# Scrapy settings for pics project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'pics'
SPIDER_MODULES = ['pics.spiders']
NEWSPIDER_MODULE = 'pics.spiders'
IMAGES_STORE = 'W:/scrapy/scraped/'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'pics (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'pics.middlewares.MyCustomSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'pics.middlewares.MyCustomDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'pics.pipelines.ImagesPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
Pipeline.py
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.http import Request
from PIL import Image
class PicsPipeline(object):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield scrapy.Request(url=img_url, meta=meta)
def process_item(self, item, spider):
return item
items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
from scrapy.item import Item
class PicsItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# pass
image_urls = scrapy.Field()
images = scrapy.Field()
image_name = scrapy.Field()
blogspot.py
import scrapy
from scrapy.selector import Selector, HtmlXPathSelector
from pics.items import PicsItem
from PIL import Image
class BlogspotSpider(scrapy.Spider):
name = "blogspot"
allowed_domains = ['blogspot.fr']
start_urls = ["http://10rambo.blogspot.fr/"]
def parse(self, response):
LOG_FILE = "spider.log"
for sel in response.xpath('/html'):
item = PicsItem()
for elem in response.xpath("//div[contains(@class, 'hentry')]"):
item['image_name'] = elem.xpath("//h3[contains(@class, 'entry-title')]/a/text()").extract()
url = elem.xpath("//div[contains(@class, 'entry-content')]/a/@href").extract_first()
item['image_urls'] = url
yield scrapy.Request(response.urljoin(url), callback=self.parse, dont_filter=True)
这就是日志在当前所说的内容:
2017-06-15 21:20:00 [scrapy] INFO: Scrapy 1.2.1 started (bot: pics)
2017-06-15 21:20:00 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'pics.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['pics.spiders'], 'LOG_FILE': 'log.log', 'BOT_NAME': 'pics'}
2017-06-15 21:20:00 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2017-06-15 21:20:00 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-06-15 21:20:00 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-06-15 21:20:00 [scrapy] INFO: Enabled item pipelines:
['pics.pipelines.ImagesPipeline']
2017-06-15 21:20:00 [scrapy] INFO: Spider opened
2017-06-15 21:20:00 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-06-15 21:20:00 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-06-15 21:20:00 [scrapy] DEBUG: Crawled (200) <GET http://10rambo.blogspot.fr/robots.txt> (referer: None)
2017-06-15 21:20:01 [scrapy] DEBUG: Crawled (200) <GET http://10rambo.blogspot.fr/> (referer: None)
2017-06-15 21:20:01 [scrapy] DEBUG: Crawled (404) <GET https://2.bp.blogspot.com/robots.txt> (referer: None)
2017-06-15 21:20:01 [scrapy] DEBUG: Crawled (200) <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
2017-06-15 21:20:01 [scrapy] ERROR: Spider error processing <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
Traceback (most recent call last):
File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "W:\scrapy\pics\pics\spiders\blogspot.py", line 17, in parse
for sel in response.xpath('/html'):
AttributeError: 'Response' object has no attribute 'xpath'
2017-06-15 21:20:01 [scrapy] DEBUG: Crawled (200) <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
2017-06-15 21:20:01 [scrapy] DEBUG: Crawled (200) <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
2017-06-15 21:20:01 [scrapy] DEBUG: Crawled (200) <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
2017-06-15 21:20:01 [scrapy] DEBUG: Crawled (200) <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
2017-06-15 21:20:01 [scrapy] DEBUG: Crawled (200) <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
2017-06-15 21:20:01 [scrapy] DEBUG: Crawled (200) <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
2017-06-15 21:20:02 [scrapy] ERROR: Spider error processing <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
Traceback (most recent call last):
File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "W:\scrapy\pics\pics\spiders\blogspot.py", line 17, in parse
for sel in response.xpath('/html'):
AttributeError: 'Response' object has no attribute 'xpath'
2017-06-15 21:20:02 [scrapy] ERROR: Spider error processing <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
Traceback (most recent call last):
File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "W:\scrapy\pics\pics\spiders\blogspot.py", line 17, in parse
for sel in response.xpath('/html'):
AttributeError: 'Response' object has no attribute 'xpath'
2017-06-15 21:20:02 [scrapy] ERROR: Spider error processing <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
Traceback (most recent call last):
File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "W:\scrapy\pics\pics\spiders\blogspot.py", line 17, in parse
for sel in response.xpath('/html'):
AttributeError: 'Response' object has no attribute 'xpath'
2017-06-15 21:20:02 [scrapy] ERROR: Spider error processing <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
Traceback (most recent call last):
File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "W:\scrapy\pics\pics\spiders\blogspot.py", line 17, in parse
for sel in response.xpath('/html'):
AttributeError: 'Response' object has no attribute 'xpath'
2017-06-15 21:20:02 [scrapy] ERROR: Spider error processing <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://12manrambotapes.blogspot.fr/)
Traceback (most recent call last):
File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "W:\scrapy\pics\pics\spiders\blogspot.py", line 17, in parse
for sel in response.xpath('/html'):
AttributeError: 'Response' object has no attribute 'xpath'
2017-06-15 21:20:02 [scrapy] ERROR: Spider error processing <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
Traceback (most recent call last):
File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "W:\scrapy\pics\pics\spiders\blogspot.py", line 17, in parse
for sel in response.xpath('/html'):
AttributeError: 'Response' object has no attribute 'xpath'
2017-06-15 21:20:02 [scrapy] INFO: Closing spider (finished)
2017-06-15 21:20:02 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 3247,
'downloader/request_count': 10,
'downloader/request_method_count/GET': 10,
'downloader/response_bytes': 2120514,
'downloader/response_count': 10,
'downloader/response_status_count/200': 9,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 6, 15, 19, 20, 2, 80000),
'log_count/DEBUG': 11,
'log_count/ERROR': 7,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 10,
'scheduler/dequeued': 8,
'scheduler/dequeued/memory': 8,
'scheduler/enqueued': 8,
'scheduler/enqueued/memory': 8,
'spider_exceptions/AttributeError': 7,
'start_time': datetime.datetime(2017, 6, 15, 19, 20, 0, 349000)}
2017-06-15 21:20:02 [scrapy] INFO: Spider closed (finished)
答案 0 :(得分:0)
这部分日志说明了错误:
File "W:\path\to\mypics\spiders\blogspot.py", line 29, in parse
yield Request(response.urljoin(url), callback=self.parse)
NameError: global name 'Request' is not defined
好像你忘了为scrapy
类指定Request
模块。但是,它与您发布的代码不对应。您没有blogspot.py
但spider.py
,并且在parse
方法中,您正确指定了scrapy
类的Request
模块。另一方面,您在方法Pipeline.py
中的get_media_requests
中错过了它:
yield Request(url=img_url, meta=meta)
答案 1 :(得分:0)
蜘蛛错误处理https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg>
它不是HTML - 它的jpg文件。它没有xpath。