Scrapy Splash - 保持记录状态

时间:2017-07-26 13:49:53

标签: python session scrapy splash scrapy-splash

我使用scrapy + splash在网站上进行连接(感谢this thread)。

我知道我已登录,因为我可以在您登录后显示一些可用的元素。但是当我尝试使用另一个 `2` `3` Guanzhou Shenzhen Hongkong 1 0.65 0.8 0.6 0.70 NA 2 0.80 NA NA 0.60 NA 3 0.70 NA NA 0.60 0.7 4 0.70 NA NA 0.65 NA 到达另一个页面时,该网站会询问再次登录。

因此,似乎scrapy(或splash)不会使会话保持活动状态。是否有东西可以保持记录,并使会话保持活动状态?

谢谢,

1 个答案:

答案 0 :(得分:1)

Splash从一个干净的状态开始每个渲染,所以如果你想保持会话,你需要先初始化cookie,并让Scrapy知道在渲染过程中设置的cookie。请参阅scrapy-splash README中的Session Handling部分。一个完整的示例可能如下所示(来自README的复制粘贴):

import scrapy
from scrapy_splash import SplashRequest

script = """
function main(splash)
  splash:init_cookies(splash.args.cookies)
  assert(splash:go{
    splash.args.url,
    headers=splash.args.headers,
    http_method=splash.args.http_method,
    body=splash.args.body,
    })
  assert(splash:wait(0.5))

  local entries = splash:history()
  local last_response = entries[#entries].response
  return {
    url = splash:url(),
    headers = last_response.headers,
    http_status = last_response.status,
    cookies = splash:get_cookies(),
    html = splash:html(),
  }
end
"""

class MySpider(scrapy.Spider):


    # ...
        yield SplashRequest(url, self.parse_result,
            endpoint='execute',
            cache_args=['lua_source'],
            args={'lua_source': script},
        )

    def parse_result(self, response):
        # here response.body contains result HTML;
        # response.headers are filled with headers from last
        # web page loaded to Splash;
        # cookies from all responses and from JavaScript are collected
        # and put into Set-Cookie response header, so that Scrapy
        # can remember them.

请注意,会话当前需要使用/ execute或/ run端点,其他端点没有帮助程序。