在本地使用scrapinghub抓取中心的飞溅

时间:2019-07-13 22:57:56

标签: python scrapy scrapy-splash scrapinghub

我在scrapinghub上收到了关于飞溅的指令,我想从本地计算机上运行的脚本中使用它。到目前为止,我有以下说明:

1)编辑设置文件:

[24663:24680:0714/004126.068170:ERROR:browser_process_sub_thread.cc(221)] Waited 5 ms for network service
Opening in existing browser session

有一个疑问,当我尝试在浏览器中打开spash服务器时,它要求我输入用户名,因此我看不到在哪里设置此名称。

enter image description here

2)蜘蛛文件:

#I got this one from my scraping hub account
SPLASH_URL = 'http://xx.x0-splash.scrapinghub.com'


DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

我没有收到错误,但是我不确定启动程序是否也起作用,除了服务器ip以外,抓取还提供了一个密码,我不知道该脚本在哪里使用。

在使用splashrequest并添加API密钥之后,这是我得到的追溯,网站的内容仍未加载。

import scrapy
import json
from scrapy import  Request
from scrapy_splash import SplashRequest
import scrapy_splash


class ListSpider(scrapy.Spider):

    name = 'list'
    allowed_domains = ['https://medium.com/']
    start_urls = ['https://medium.com/']

    def parse(self, response):
        print (response.body)
        with open('data/cookies_file.json') as f:
            cookies_data = json.loads(f.read())[0]
        #print (cookies_data)
        url = 'https://medium.com/' 
        #cookies=cookies_data,
        yield Request(url,  callback=self.afterlogin,meta={'splash': {'args': {'html': 1, 'png': 1,}}})






    def afterlogin(self,response):
        with open(data_dir + 'after_login_page.html','w') as f:
            f.write(str(response.body))

编辑:

这是我得到的完整日志;

2019-07-17 10:10:08 [scrapy.core.engine] INFO: Spider opened
2019-07-17 10:10:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-07-17 10:10:08 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-07-17 10:10:09 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "www.meetmindful.com"; '*.meetmindful.com'!='www.meetmindful.com'
2019-07-17 10:10:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.meetmindful.com/> (referer: None)
2019-07-17 10:10:13 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "uyu74ur0-splash.scrapinghub.com"; '*.scrapinghub.com'!='uyu74ur0-splash.scrapinghub.com'
2019-07-17 10:10:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://app.meetmindful.com/login via https://uyu74ur0-splash.scrapinghub.com/render.html> (referer: None)
2019-07-17 10:10:20 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://app.meetmindful.com/grid via https://uyu74ur0-splash.scrapinghub.com/render.html> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>]
2019-07-17 10:10:21 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "uyu74ur0-splash.scrapinghub.com"; '*.scrapinghub.com'!='uyu74ur0-splash.scrapinghub.com'
2019-07-17 10:10:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://app.meetmindful.com/grid via https://uyu74ur0-splash.scrapinghub.com/render.html> (referer: None)
2019-07-17 10:10:26 [scrapy.core.engine] INFO: Closing spider (finished)
2019-07-17 10:10:26 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 1,
 'downloader/request_bytes': 2952,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 1,
 'downloader/request_method_count/POST': 3,
 'downloader/response_bytes': 28104,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 7, 17, 14, 10, 26, 292646),
 'log_count/DEBUG': 5,
 'log_count/INFO': 8,
 'log_count/WARNING': 3,
 'memusage/max': 54104064,
 'memusage/startup': 54104064,
 'request_depth_max': 2,
 'response_received_count': 3,
 'retry/count': 1,
 'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 1,
 'scheduler/dequeued': 6,
 'scheduler/dequeued/memory': 6,
 'scheduler/enqueued': 6,
 'scheduler/enqueued/memory': 6,
 'splash/render.html/request_count': 2,
 'splash/render.html/response_count/200': 2,
 'start_time': datetime.datetime(2019, 7, 17, 14, 10, 8, 200073)}
2019-07-17 10:10:26 [scrapy.core.engine] INFO: Spider closed (finished)

1 个答案:

答案 0 :(得分:0)

如果您查看他们的示例文件,他们已经展示了如何使用它

https://github.com/scrapy-plugins/scrapy-splash/blob/e40ca4f3b367ab463273bee1357d3edfe0601f0d/example/scrashtest/spiders/quotes.py

Computer

此外,您需要产生# -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy_splash import SplashRequest class QuotesSpider(scrapy.Spider): name = "quotes" allowed_domains = ["toscrape.com"] start_urls = ['http://quotes.toscrape.com/'] # http_user = 'splash-user' # http_pass = 'splash-password' def parse(self, response): ... 而不是SplashRequest,实际上您的代码中根本没有使用Splash

Request

应该是

yield Request(url,  callback=self.afterlogin,meta={'splash': {'args': {'html': 1, 'png': 1,}}})