我已经研究了几天,我发现很多答案都有点像我的问题,但事实并非如此,我决定继续发表这个问题。我正在使用scrapy-splash来抓取KBB。通过使用send_text和send_keys,我能够绕过愚蠢的第一次使用popup的东西,这在浏览器版本的Splash中运行得非常好。它就像我想要的那样引入动态内容,太棒了!
以下是易于复制粘贴功能的代码;
function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(0.5))
splash:send_text("24153")
splash:send_keys("<Return>")
assert(splash:wait(5))
return {
html = splash:html(),
png = splash:png(),
har = splash:har(),
}
end
现在我正在尝试使其在脚本中工作,因为我希望能够一次渲染多个HTML文件。这是我到目前为止的代码,我现在只有两个URL来测试:
import scrapy
from scrapy_splash import SplashRequest
class MySpider(scrapy.Spider):
name = "cars"
start_urls = ["https://www.kbb.com/ford/escape/2017/titanium/", "https://www.kbb.com/honda/cr-v/2017/touring/"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
endpoint='render.html',
args={'wait': 0.5, 'send_text':24153, 'send_keys':'<Return>', 'wait': 5.0},
)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'car-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
当我尝试运行时,它会一直告诉我事情超时:
2018-01-16 19:34:31 [scrapy.extensions.logstats] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-01-16 19:35:02 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://192.168.65.0:8050/robots.txt> (failed 2 times): TCP connection timed out: 60: Operation timed out.
2018-01-16 19:35:31 [scrapy.extensions.logstats] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-01-16 19:36:17 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://192.168.65.0:8050/robots.txt> (failed 3 times): TCP connection timed out: 60: Operation timed out.
2018-01-16 19:36:17 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET http://192.168.65.0:8050/robots.txt>: TCP connection timed out: 60: Operation timed out.
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 60: Operation timed out.
2018-01-16 19:36:31 [scrapy.extensions.logstats] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-01-16 19:37:31 [scrapy.extensions.logstats] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-01-16 19:37:32 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.kbb.com/ford/escape/2017/titanium/ via http://192.168.65.0:8050/render.html> (failed 1 times): TCP connection timed out: 60: Operation timed out.
2018-01-16 19:37:32 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.kbb.com/honda/cr-v/2017/touring/ via http://192.168.65.0:8050/render.html> (failed 1 times): TCP connection timed out: 60: Operation timed out.
2018-01-16 19:38:31 [scrapy.extensions.logstats] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-01-16 19:38:48 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.kbb.com/ford/escape/2017/titanium/ via http://192.168.65.0:8050/render.html> (failed 2 times): TCP connection timed out: 60: Operation timed out.
2018-01-16 19:38:48 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.kbb.com/honda/cr-v/2017/touring/ via http://192.168.65.0:8050/render.html> (failed 2 times): TCP connection timed out: 60: Operation timed out.
这是我在底部的settings.py自定义内容,不确定你是否需要整个内容,因为大部分都被注释掉了:
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPLASH_URL = 'http://localhost:8050/'
SPLASH_URL = 'http://192.168.65.0:8050'
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
我一直在关注多个试图让它工作的教程。我假设它与SPIDER_MIDDLEWARES有关,但我不知道需要改变什么。我对蜘蛛很新,所以任何帮助都会非常感激。
答案 0 :(得分:0)
这花了差不多两个星期但我最终得到了我想要的东西。不得不切换到AutoBlog,因为KBB没有我需要的一切。 AutoBlog的问题在于它只在你实际滚动到它时才加载页面的底部,所以我使用mouse_click点击导航按钮将其向下滚动到我需要的页面部分。然后我在渲染之前等了几秒钟。
import scrapy
from scrapy_splash import SplashRequest
class MySpider(scrapy.Spider):
name = "cars"
start_urls = ["https://www.autoblog.com/buy/2017-Ford-Escape-SE__4dr_4x4/", "https://www.autoblog.com/buy/2017-Honda-CR_V-EX_L__4dr_Front_wheel_Drive/"]
script="""
function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(10.0))
splash:mouse_click(800, 335)
assert(splash:wait(10.0))
return {
html = splash:html()
}
end
"""
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
endpoint='execute',
args={'lua_source': self.script, 'wait': 1.0},
)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'car-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
还要做一些补充,需要添加更多网址,但这是一个功能正常的代码块!