Splash + Scrapy,脚本嵌入,scrapy extract()不起作用

时间:2018-06-25 12:29:08

标签: python scrapy scrapy-splash

我的问题是我无法将Splash脚本嵌入到我的Scrapy爬虫中,Splash正常运行,并且设法在http://localhost:8050的浏览器中呈现了所需的内容,因此我复制了脚本并尝试解析该脚本。使用Scrapy的html是我的蜘蛛:

import scrapy

from scrapy_splash import SplashRequest

class Ntest(scrapy.Spider):
    name = "test"

    script = """
        function main(splash)
          splash.private_mode_enabled = false
          splash.html5_media_enabled = true
          assert(splash:go(args.url))
          assert(splash:wait(0.3))
          return {
                html = splash:html(),
                png = splash:png(),
                har = splash:har(),
                }
    end
        """



def start_request(self, response):
    yield SplashRequest(
        url = 'https://www.mp4upload.com/embed-yfani9opk91x.html',
        endpoint='render.html',
        args={'lua_source': self.script},
        callback=self.parse,
    )


def parse(self, response):
    r = response.css('body').extract()

这是我的settings.py:

SPLASH_URL = 'http://localhost:8050/'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':       810,
}

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

我运行scrapy runspider .\main.py

我明白了:

2018-06-25 14:17:38 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: 

scrapybot)
2018-06-25 14:17:38 [scrapy.utils.log] INFO: Versions: lxml 4.2.2.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.4.0, Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 16:07:46) [MSC v.1900 32 bit (Intel)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h  27 Mar 2018), cryptography 2.2.2, Platform Windows-10-10.0.17134-SP0
2018-06-25 14:17:39 [scrapy.crawler] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'SPIDER_LOADER_WARN_ONLY': True}
2018-06-25 14:17:39 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2018-06-25 14:17:39 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy_splash.SplashCookiesMiddleware',
 'scrapy_splash.SplashMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-06-25 14:17:39 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy_splash.SplashDeduplicateArgsMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-06-25 14:17:39 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-06-25 14:17:39 [scrapy.core.engine] INFO: Spider opened
2018-06-25 14:17:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-06-25 14:17:39 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-06-25 14:17:39 [scrapy.core.engine] INFO: Closing spider (finished)
2018-06-25 14:17:39 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 6, 25, 12, 17, 39, 112025),
 'log_count/DEBUG': 1,
 'log_count/INFO': 7,
 'start_time': datetime.datetime(2018, 6, 25, 12, 17, 39, 104037)}
2018-06-25 14:17:39 [scrapy.core.engine] INFO: Spider closed (finished)

我应该从html提取正文,请帮助。

1 个答案:

答案 0 :(得分:0)

从日志中很明显,没有任何请求正在执行。

如果代码像在您的帖子中一样缩进,则start_request()parse()在Spider类之外定义。 即使不是,正确的方法名称是start_requests()