我的问题是我无法将Splash脚本嵌入到我的Scrapy爬虫中,Splash正常运行,并且设法在http://localhost:8050的浏览器中呈现了所需的内容,因此我复制了脚本并尝试解析该脚本。使用Scrapy的html是我的蜘蛛:
import scrapy
from scrapy_splash import SplashRequest
class Ntest(scrapy.Spider):
name = "test"
script = """
function main(splash)
splash.private_mode_enabled = false
splash.html5_media_enabled = true
assert(splash:go(args.url))
assert(splash:wait(0.3))
return {
html = splash:html(),
png = splash:png(),
har = splash:har(),
}
end
"""
def start_request(self, response):
yield SplashRequest(
url = 'https://www.mp4upload.com/embed-yfani9opk91x.html',
endpoint='render.html',
args={'lua_source': self.script},
callback=self.parse,
)
def parse(self, response):
r = response.css('body').extract()
这是我的settings.py:
SPLASH_URL = 'http://localhost:8050/'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
我运行scrapy runspider .\main.py
我明白了:
2018-06-25 14:17:38 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot:
scrapybot)
2018-06-25 14:17:38 [scrapy.utils.log] INFO: Versions: lxml 4.2.2.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.4.0, Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 16:07:46) [MSC v.1900 32 bit (Intel)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.2.2, Platform Windows-10-10.0.17134-SP0
2018-06-25 14:17:39 [scrapy.crawler] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'SPIDER_LOADER_WARN_ONLY': True}
2018-06-25 14:17:39 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2018-06-25 14:17:39 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-06-25 14:17:39 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy_splash.SplashDeduplicateArgsMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-06-25 14:17:39 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-06-25 14:17:39 [scrapy.core.engine] INFO: Spider opened
2018-06-25 14:17:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-06-25 14:17:39 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-06-25 14:17:39 [scrapy.core.engine] INFO: Closing spider (finished)
2018-06-25 14:17:39 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 6, 25, 12, 17, 39, 112025),
'log_count/DEBUG': 1,
'log_count/INFO': 7,
'start_time': datetime.datetime(2018, 6, 25, 12, 17, 39, 104037)}
2018-06-25 14:17:39 [scrapy.core.engine] INFO: Spider closed (finished)
我应该从html提取正文,请帮助。
答案 0 :(得分:0)
从日志中很明显,没有任何请求正在执行。
如果代码像在您的帖子中一样缩进,则start_request()
和parse()
在Spider类之外定义。
即使不是,正确的方法名称是start_requests()
。