我在scrapinghub上收到了关于飞溅的指令,我想从本地计算机上运行的脚本中使用它。到目前为止,我有以下说明:
1)编辑设置文件:
[24663:24680:0714/004126.068170:ERROR:browser_process_sub_thread.cc(221)] Waited 5 ms for network service
Opening in existing browser session
有一个疑问,当我尝试在浏览器中打开spash服务器时,它要求我输入用户名,因此我看不到在哪里设置此名称。
2)蜘蛛文件:
#I got this one from my scraping hub account
SPLASH_URL = 'http://xx.x0-splash.scrapinghub.com'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
我没有收到错误,但是我不确定启动程序是否也起作用,除了服务器ip以外,抓取还提供了一个密码,我不知道该脚本在哪里使用。
在使用splashrequest并添加API密钥之后,这是我得到的追溯,网站的内容仍未加载。
import scrapy
import json
from scrapy import Request
from scrapy_splash import SplashRequest
import scrapy_splash
class ListSpider(scrapy.Spider):
name = 'list'
allowed_domains = ['https://medium.com/']
start_urls = ['https://medium.com/']
def parse(self, response):
print (response.body)
with open('data/cookies_file.json') as f:
cookies_data = json.loads(f.read())[0]
#print (cookies_data)
url = 'https://medium.com/'
#cookies=cookies_data,
yield Request(url, callback=self.afterlogin,meta={'splash': {'args': {'html': 1, 'png': 1,}}})
def afterlogin(self,response):
with open(data_dir + 'after_login_page.html','w') as f:
f.write(str(response.body))
编辑:
这是我得到的完整日志;
2019-07-17 10:10:08 [scrapy.core.engine] INFO: Spider opened
2019-07-17 10:10:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-07-17 10:10:08 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-07-17 10:10:09 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "www.meetmindful.com"; '*.meetmindful.com'!='www.meetmindful.com'
2019-07-17 10:10:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.meetmindful.com/> (referer: None)
2019-07-17 10:10:13 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "uyu74ur0-splash.scrapinghub.com"; '*.scrapinghub.com'!='uyu74ur0-splash.scrapinghub.com'
2019-07-17 10:10:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://app.meetmindful.com/login via https://uyu74ur0-splash.scrapinghub.com/render.html> (referer: None)
2019-07-17 10:10:20 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://app.meetmindful.com/grid via https://uyu74ur0-splash.scrapinghub.com/render.html> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>]
2019-07-17 10:10:21 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "uyu74ur0-splash.scrapinghub.com"; '*.scrapinghub.com'!='uyu74ur0-splash.scrapinghub.com'
2019-07-17 10:10:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://app.meetmindful.com/grid via https://uyu74ur0-splash.scrapinghub.com/render.html> (referer: None)
2019-07-17 10:10:26 [scrapy.core.engine] INFO: Closing spider (finished)
2019-07-17 10:10:26 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 1,
'downloader/request_bytes': 2952,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 1,
'downloader/request_method_count/POST': 3,
'downloader/response_bytes': 28104,
'downloader/response_count': 3,
'downloader/response_status_count/200': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 7, 17, 14, 10, 26, 292646),
'log_count/DEBUG': 5,
'log_count/INFO': 8,
'log_count/WARNING': 3,
'memusage/max': 54104064,
'memusage/startup': 54104064,
'request_depth_max': 2,
'response_received_count': 3,
'retry/count': 1,
'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 1,
'scheduler/dequeued': 6,
'scheduler/dequeued/memory': 6,
'scheduler/enqueued': 6,
'scheduler/enqueued/memory': 6,
'splash/render.html/request_count': 2,
'splash/render.html/response_count/200': 2,
'start_time': datetime.datetime(2019, 7, 17, 14, 10, 8, 200073)}
2019-07-17 10:10:26 [scrapy.core.engine] INFO: Spider closed (finished)
答案 0 :(得分:0)
如果您查看他们的示例文件,他们已经展示了如何使用它
Computer
此外,您需要产生# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy_splash import SplashRequest
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ["toscrape.com"]
start_urls = ['http://quotes.toscrape.com/']
# http_user = 'splash-user'
# http_pass = 'splash-password'
def parse(self, response):
...
而不是SplashRequest
,实际上您的代码中根本没有使用Splash
Request
应该是
yield Request(url, callback=self.afterlogin,meta={'splash': {'args': {'html': 1, 'png': 1,}}})