我最近开始使用scrapy和python,它只是一个失望,我被告知访问一个网站很容易,并下载该网站上的所有图像。问题是,过去3天所有scrapy一直在做的是不断重定向到页面上的其他URL吗?
我不知道scrapy是为了什么?我还没有告诉它这样做吗?这是什么?我只想下载网站上的所有图片。
import scrapy
from PIL import Image
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from tumblr.items import TumblrItem
from scrapy.contrib.loader import XPathItemLoader
from scrapy.selector import HtmlXPathSelector
class TumblrSpider(CrawlSpider):
name = 'tumblr'
allowed_domains = ['justanimegifs.tumblr.com', '78.media.tumblr.com']
start_urls = ['https://justanimegifs.tumblr.com/']
rules = (
Rule(LinkExtractor(allow=(".",), deny=(["/post/*", "/image/*"])), callback='parse_item', follow=True),
)
def parse_tumblr(self, response):
loader = XPathItemLoader(item = TumblrItem(), response = response)
loader.add_xpath('image_urls', '//img/@src')
return loader.load_item()
继承了一些调试......
C:\Users\admin\Desktop\python-dev\web-crawler\tumblr>scrapy crawl tumblr
C:\Users\admin\Desktop\python-dev\web-crawler\tumblr\tumblr\spiders\tumblr_spider.py:4: ScrapyDeprecationWarning: Module `scrapy.contrib.spiders` is deprecated, use `scrapy.spiders` instead
from scrapy.contrib.spiders import Rule, CrawlSpider
C:\Users\admin\Desktop\python-dev\web-crawler\tumblr\tumblr\spiders\tumblr_spider.py:5: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors` is deprecated, use `scrapy.linkextractors` instead
from scrapy.contrib.linkextractors import LinkExtractor
C:\Users\admin\Desktop\python-dev\web-crawler\tumblr\tumblr\spiders\tumblr_spider.py:7: ScrapyDeprecationWarning: Module `scrapy.contrib.loader` is deprecated, use `scrapy.loader` instead
from scrapy.contrib.loader import XPathItemLoader
2017-12-23 07:49:50 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: tumblr)
2017-12-23 07:49:50 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'tumblr', 'NEWSPIDER_MODULE': 'tumblr.spiders', 'SPIDER_MODULES': ['tumblr.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.0; rv:2.0) Gecko/20100101 Firefox/4.0'}
2017-12-23 07:49:50 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2017-12-23 07:49:50 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-12-23 07:49:50 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-12-23 07:49:50 [py.warnings] WARNING: C:\Users\admin\Miniconda3\lib\site-packages\scrapy\utils\deprecate.py:156: ScrapyDeprecationWarning: `scrapy.contrib.pipeline.images.ImagesPipeline` class is deprecated, use `scrapy.pipelines.images.ImagesPipeline` instead
ScrapyDeprecationWarning)
2017-12-23 07:49:50 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy.pipelines.images.ImagesPipeline']
2017-12-23 07:49:50 [scrapy.core.engine] INFO: Spider opened
2017-12-23 07:49:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-12-23 07:49:50 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-12-23 07:49:51 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://justanimegifs.tumblr.com/#_=_> from <GET https://justanimegifs.tumblr.com/>
2017-12-23 07:49:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://justanimegifs.tumblr.com/#_=_> (referer: None)
2017-12-23 07:49:51 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.tumblr.com': <GET https://www.tumblr.com/reblog/168803719954/s3QQXppA>
2017-12-23 07:49:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://justanimegifs.tumblr.com/> (referer: http://justanimegifs.tumblr.com/)
2017-12-23 07:49:52 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://justanimegifs.tumblr.com/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2017-12-23 07:49:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://justanimegifs.tumblr.com/ask> (referer: http://justanimegifs.tumblr.com/)
2017-12-23 07:49:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://justanimegifs.tumblr.com/faq> (referer: http://justanimegifs.tumblr.com/)
2017-12-23 07:49:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://justanimegifs.tumblr.com/tags> (referer: http://justanimegifs.tumblr.com/)
2017-12-23 07:49:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://justanimegifs.tumblr.com/tagged/itazura+na+kiss> (referer: http://justanimegifs.tumblr.com/tags)
2017-12-23 07:49:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://justanimegifs.tumblr.com/tagged/katekyo+hitman+reborn> (referer: http://justanimegifs.tumblr.com/tags)
2017-12-23 07:49:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://justanimegifs.tumblr.com/tagged/kaichou+wa+maid+sama> (referer: http://justanimegifs.tumblr.com/tags)
2017-12-23 07:49:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://justanimegifs.tumblr.com/tagged/hunter+x+hunter> (referer: http://justanimegifs.tumblr.com/tags)
2017-12-23 07:49:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://justanimegifs.tumblr.com/tagged/kimi+ni+todoke> (referer: http://justanimegifs.tumblr.com/tags)
2017-12-23 07:49:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://justanimegifs.tumblr.com/tagged/howls+moving+castle> (referer: http://justanimegifs.tumblr.com/tags)
2017-12-23 07:49:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://justanimegifs.tumblr.com/tagged/Hotarubi+no+mori+e> (referer: http://justanimegifs.tumblr.com/tags)
2017-12-23 07:49:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://justanimegifs.tumblr.com/tagged/inuyasha> (referer: http://justanimegifs.tumblr.com/tags)
2017-12-23 07:49:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://justanimegifs.tumblr.com/tagged/hotd> (referer: http://justanimegifs.tumblr.com/tags)
2017-12-23 07:49:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://justanimegifs.tumblr.com/tagged/hyouka> (referer: http://justanimegifs.tumblr.com/tags)
2017-12-23 07:49:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://justanimegifs.tumblr.com/tagged/guilty+crown> (referer: http://justanimegifs.tumblr.com/tags)
2017-12-23 07:49:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://justanimegifs.tumblr.com/tagged/inu+x+boku+ss> (referer: http://justanimegifs.tumblr.com/tags)
2017-12-23 07:49:53 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'wewereborntudie.tumblr.com': <GET http://wewereborntudie.tumblr.com/tagged/inuyasha>
2017-12-23 07:49:53 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'justa-lovesong.tumblr.com': <GET http://justa-lovesong.tumblr.com>
2017-12-23 07:49:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://justanimegifs.tumblr.com/tagged/eden+of+the+east> (referer: http://justanimegifs.tumblr.com/tags)
2017-12-23 07:49:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://justanimegifs.tumblr.com/tagged/tsuritama> (referer: http://justanimegifs.tumblr.com/tags)
2017-12-23 07:49:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://justanimegifs.tumblr.com/tagged/ours> (referer: http://justanimegifs.tumblr.com/tagged/itazura+na+kiss)
2017-12-23 07:49:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://justanimegifs.tumblr.com/tagged/khr> (referer: http://justanimegifs.tumblr.com/tagged/katekyo+hitman+reborn)
2017-12-23 07:49:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://justanimegifs.tumblr.com/tagged/uta+no+prince+sama> (referer: http://justanimegifs.tumblr.com/tags)
2017-12-23 07:49:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://justanimegifs.tumblr.com/tagged/vampire+knight> (referer: http://justanimegifs.tumblr.com/tags)
2017-12-23 07:49:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://justanimegifs.tumblr.com/tagged/leorio> (referer: http://justanimegifs.tumblr.com/tagged/hunter+x+hunter)
2017-12-23 07:49:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://justanimegifs.tumblr.com/tagged/toradora> (referer: http://justanimegifs.tumblr.com/tags)
2017-12-23 07:49:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://justanimegifs.tumblr.com/tagged/sawako> (referer: http://justanimegifs.tumblr.com/tagged/kimi+ni+todoke)
2017-12-23 07:49:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://justanimegifs.tumblr.com/tagged/fairy+tail> (referer: http://justanimegifs.tumblr.com/tags)
2017-12-23 07:49:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://justanimegifs.tumblr.com/tagged/fullmetal+alchemist> (referer: http://justanimegifs.tumblr.com/tags)