如何使用Splash(JS渲染服务)与代理

时间:2017-02-19 22:24:52

标签: python curl proxy scrapy-splash splash-js-render

它在Scrapy中自动配置,但不在Curl或普通请求中配置。

在curl中,我们可以在没有任何代理的情况下执行此操作:

http://<server_ip>:8050/render.html?url=http://www.example.com/?timeout=10&wait=0.5

如何使用代理?

我试过了:

http://<server_ip>:8050/render.html?url=http://www.example.com/?timeout=10&wait=0.5 --proxy myproxy:port

但我得到了:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  <title>Lightspeed Systems - Web Access</title>

  <style type="text/css">
    html {
      background: #13396b; /* Old browsers */
      /* IE9 SVG, needs conditional override of 'filter' to 'none' */
      background: url();
      background: -moz-linear-gradient(top,  #13396b 0%, #3e6599 100%); /* FF3.6+ */
      background: -webkit-gradient(linear, left top, left bottom, color-stop(0%,#13396b), color-stop(100%,#3e6599)); /* Chrome,Safari4+ */
      background: -webkit-linear-gradient(top,  #13396b 0%,#3e6599 100%); /* Chrome10+,Safari5.1+ */
      background: -o-linear-gradient(top,  #13396b 0%,#3e6599 100%); /* Opera 11.10+ */
      background: -ms-linear-gradient(top,  #13396b 0%,#3e6599 100%); /* IE10+ */
      background: linear-gradient(to bottom,  #13396b 0%,#3e6599 100%); /* W3C */
      filter: progid:DXImageTransform.Microsoft.gradient( startColorstr='#13396b', endColorstr='#3e6599',GradientType=0 ); /* IE6-8 */
      height: 100%;
    }
    body {
      width: 960px;
      overflow: hidden;
      margin: 50px auto;
      font-family: "HelveticaNeue-Light", "Helvetica Neue Light", "Helvetica Neue", Helvetica, Arial, "Lucida Grande", sans-serif;
      font-size: 14px;
      color: #a2c3ef;
    }
    h1,h2 {
      color: #fff;
    }
    h1 {
      font-size: 32px;
      font-weight: normal;
    }
    h2 {
      font-size: 24px;
      font-weight: lighter;
    }
    a {
      color: #fff;
      font-weight: bold;
    }
    #content {
      margin: 20px 0 20px 30px;
    }
    blockquote#error, blockquote#data {
      color: #fff;
      font-size: 16px;
    }
    #footer p {
      font-size: 12px;
      padding: 7px 12px;
      margin-top: 10px;
      color: #fff;
      text-align: right;
    }
</style>

<!--[if gte IE 9]>
  <style type="text/css">
    .gradient {
      filter: none;
    }
  </style>
<![endif]-->
</head>

<body id=ERR_ACCESS_DENIED>
  <div id="titles">
    <h1>ERROR</h1>
    <h2>Unable to complete URL request</h2>
  </div>
  <hr>
  <div id="content">
    <p>An error has occurred while trying to access <a href="http://<server_ip>:8050/render.html?">http://<server_ip>:8050/render.html?</a>.</p>

    <blockquote id="error">
      <p><b>Access denied.</b></p>
    </blockquote>

    <p>Security permissions are not allowing the request attempt. Please contact your service provider if you feel this is incorrect.</p>
  </div>

  <hr>
  <div id="footer">
  </div>
</body>
</html>
C:\Users\Dr. Printer>curl "http://<server_ip>:8050/render.html?url=http://www.example.com/?timeout=30&wait=0.5"
{"description": "Timeout exceeded rendering page", "type": "GlobalTimeoutError", "info": {"timeout": 30.0}, "error": 504}

1 个答案:

答案 0 :(得分:0)

如果我们想使用Crawlera作为代理,我们可以使用这个lua脚本来实现

function use_crawlera(splash)
    -- Make sure you pass your Crawlera API key in the 'crawlera_user' arg.
    -- Have a look at the file spiders/quotes-js.py to see how to do it.
    -- Find your Crawlera credentials in https://app.scrapinghub.com/
    local user = splash.args.crawlera_user

    local host = 'proxy.crawlera.com'
    local port = 8010
    local session_header = 'X-Crawlera-Session'
    local session_id = 'create'

    splash:on_request(function (request)
        -- The commented code below can be used to speed up the crawling
        -- process. They filter requests to undesired domains and useless
        -- resources. Uncomment the ones that make sense to your use case
        -- and add your own rules.

        -- Discard requests to advertising and tracking domains.
        if string.find(request.url, 'doubleclick%.net') or
           string.find(request.url, 'analytics%.google%.com') then
           request.abort()
           return
        end

        -- Avoid using Crawlera for subresources fetching to increase crawling
        -- speed. The example below avoids using Crawlera for URLS starting
        -- with 'static.' and the ones ending with '.png'.
        if string.find(request.url, '://static%.') ~= nil or
           string.find(request.url, '%.png$') ~= nil then
           return
        end

        request:set_header('X-Crawlera-Cookies', 'disable')
        request:set_header(session_header, session_id)
        request:set_proxy{{host, port, username=user, password=''}}
    end)

    splash:on_response_headers(function (response)
        if type(response.headers[session_header]) ~= nil then
            session_id = response.headers[session_header]
        end
    end)
end

function main(splash)
    use_crawlera(splash)
    splash:init_cookies(splash.args.cookies)
    assert(splash:go{{
        splash.args.url,
        headers=splash.args.headers,
        http_method=splash.args.http_method,
    }})   
    assert(splash:wait({0}))
    return {{
        html = splash:html(),
        cookies = splash:get_cookies(),
    }}
end

不要忘记安装scrapy-crawlera并在设置中将其激活。有关更多信息,请参阅https://support.scrapinghub.com/support/solutions/articles/22000188428-using-crawlera-with-splash-scrapy