Scrapy + Splash为任何网站

时间:2017-12-19 21:02:35

标签: python web-scraping scrapy splash

出于某种原因,我在使用Splash时有任何请求403。我做错了什么?

关注https://github.com/scrapy-plugins/scrapy-splash我设置了所有设置:

SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

使用泊坞窗启动启动

  

sudo docker run -p 8050:8050 scrapinghub / splash

蜘蛛代码:

import scrapy

from scrapy import Selector
from scrapy_splash import SplashRequest


class VestiaireSpider(scrapy.Spider):
    name = "vestiaire"
    base_url = "https://www.vestiairecollective.com"
    rotate_user_agent = True

    def start_requests(self):
        urls = ["https://www.vestiairecollective.com/men-clothing/jeans/"]
        for url in urls:
            yield SplashRequest(url=url, callback=self.parse, meta={'args': {"wait": 0.5}})

    def parse(self, response):
        data = Selector(response)
        category_name = data.xpath('//h1[@class="campaign campaign-title clearfix"]/text()').extract_first().strip()
        self.log(category_name)

然后我跑蜘蛛:

  

scrapy爬行测试

请求获取请求网址的403:

  

2017-12-19 22:55:17 [scrapy.utils.log] INFO:Scrapy 1.4.0开始了   (机器人:爬虫)2017-12-19 22:55:17 [scrapy.utils.log]信息:   重写设置:{'DUPEFILTER_CLASS':   'scrapy_splash.SplashAwareDupeFilter','CONCURRENT_REQUESTS':10,   'NEWSPIDER_MODULE':'crawlers.spiders','SPIDER_MODULES':   ['crawlers.spiders'],'ROBOTSTXT_OBEY':是的,'COOKIES_ENABLED':   错误,'BOT_NAME':'抓取工具','HTTPCACHE_STORAGE':   'scrapy_splash.SplashAwareFSCacheStorage'} 2017-12-19 22:55:17   [scrapy.middleware]信息:已启用扩展程序:   [ 'scrapy.extensions.telnet.TelnetConsole',   'scrapy.extensions.logstats.LogStats',   'scrapy.extensions.memusage.MemoryUsage',   'scrapy.extensions.corestats.CoreStats'] 2017-12-19 22:55:17   [scrapy.middleware]信息:启用下载中间件:   [ 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',   'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',   'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',   'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',   'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',   'scrapy.downloadermiddlewares.retry.RetryMiddleware',   'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',   'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',   'scrapy_splash.SplashCookiesMiddleware',   'scrapy_splash.SplashMiddleware',   'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',   'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',   'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-12-19   22:55:17 [scrapy.middleware]信息:启用蜘蛛中间件:   [ 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',   'scrapy_splash.SplashDeduplicateArgsMiddleware',   'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',   'scrapy.spidermiddlewares.referer.RefererMiddleware',   'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',   'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-12-19 22:55:17   [scrapy.middleware]信息:已启用的项目管道:   ['scrapy.pipelines.images.ImagesPipeline'] 2017-12-19 22:55:17   [scrapy.core.engine]信息:蜘蛛开启2017-12-19 22:55:17   [scrapy.extensions.logstats]信息:抓取0页(0页/分钟),   刮掉0件(0件/分)2017-12-19 22:55:17   [scrapy.extensions.telnet] DEBUG:正在监听的Telnet控制台   127.0.0.1:6023 2017-12-19 22:55:20 [scrapy.core.engine] DEBUG:Crawled(200)https://www.vestiairecollective.com/robots.txt> (引荐:   无)2017-12-19 22:55:22 [scrapy.core.engine] DEBUG:已抓(403)   HTTP://本地主机:8050 / robots.txt的> (参考文献:无)2017-12-19   22:55:23 [scrapy.core.engine] DEBUG:Crawled(403)https://www.vestiairecollective.com/men-clothing/jeans/ via   http://localhost:8050/render.html> (推荐人:无)2017-12-19 22:55:23   [scrapy.spidermiddlewares.httperror]信息:忽略响应< 403   https://www.vestiairecollective.com/men-clothing/jeans/>:HTTP状态   代码未处理或不允许2017-12-19 22:55:23   [scrapy.core.engine]信息:关闭蜘蛛(已完成)2017-12-19   22:55:23 [scrapy.statscollectors]信息:倾倒Scrapy统计:   {'downloader / request_bytes':1254,'downloader / request_count':3,   'downloader / request_method_count / GET':2,   'downloader / request_method_count / POST':1,   'downloader / response_bytes':2793,'downloader / response_count':3,   'downloader / response_status_count / 200':1,   'downloader / response_status_count / 403':2,'finish_reason':   'finished','finish_time':datetime.datetime(2017,12,19,20,55,   23,440598),'httperror / response_ignored_count':1,   'httperror / response_ignored_status_count / 403':1,'log_count / DEBUG':   4,'log_count / INFO':8,'memusage / max':53850112,   'memusage / startup':53850112,'response_received_count':3,   'scheduler / dequeued':2,'scheduler / dequeued / memory':2,   'scheduler / enqueued':2,'scheduler / enqueued / memory':2,   'splash / render.html / request_count':1,   'splash / render.html / response_count / 403':1,'start_time':   datetime.datetime(2017,12,19,20,55,17,372080)} 2017-12-19   22:55:23 [scrapy.core.engine]信息:蜘蛛关闭(完成)

1 个答案:

答案 0 :(得分:1)

问题在于User-Agent。许多网站都要求它进行访问。 访问网站并避免禁止的最简单方法是使用此lib随机化用户代理。 https://github.com/cnu/scrapy-random-useragent