它在Scrapy中自动配置,但不在Curl或普通请求中配置。
在curl中,我们可以在没有任何代理的情况下执行此操作:
http://<server_ip>:8050/render.html?url=http://www.example.com/?timeout=10&wait=0.5
如何使用代理?
我试过了:
http://<server_ip>:8050/render.html?url=http://www.example.com/?timeout=10&wait=0.5 --proxy myproxy:port
但我得到了:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Lightspeed Systems - Web Access</title>
<style type="text/css">
html {
background: #13396b; /* Old browsers */
/* IE9 SVG, needs conditional override of 'filter' to 'none' */
background: url(data:image/svg+xml;base64,PD94bWwgdmVyc2lvbj0iMS4wIiA/Pgo8c3ZnIHhtbG5zPSJodHRwOi8vd3d3LnczLm9yZy8yMDAwL3N2ZyIgd2lkdGg9IjEwMCUiIGhlaWdodD0iMTAwJSIgdmlld0JveD0iMCAwIDEgMSIgcHJlc2VydmVBc3BlY3RSYXRpbz0ibm9uZSI+CiAgPGxpbmVhckdyYWRpZW50IGlkPSJncmFkLXVjZ2ctZ2VuZXJhdGVkIiBncmFkaWVudFVuaXRzPSJ1c2VyU3BhY2VPblVzZSIgeDE9IjAlIiB5MT0iMCUiIHgyPSIwJSIgeTI9IjEwMCUiPgogICAgPHN0b3Agb2Zmc2V0PSIwJSIgc3RvcC1jb2xvcj0iIzEzMzk2YiIgc3RvcC1vcGFjaXR5PSIxIi8+CiAgICA8c3RvcCBvZmZzZXQ9IjEwMCUiIHN0b3AtY29sb3I9IiMzZTY1OTkiIHN0b3Atb3BhY2l0eT0iMSIvPgogIDwvbGluZWFyR3JhZGllbnQ+CiAgPHJlY3QgeD0iMCIgeT0iMCIgd2lkdGg9IjEiIGhlaWdodD0iMSIgZmlsbD0idXJsKCNncmFkLXVjZ2ctZ2VuZXJhdGVkKSIgLz4KPC9zdmc+);
background: -moz-linear-gradient(top, #13396b 0%, #3e6599 100%); /* FF3.6+ */
background: -webkit-gradient(linear, left top, left bottom, color-stop(0%,#13396b), color-stop(100%,#3e6599)); /* Chrome,Safari4+ */
background: -webkit-linear-gradient(top, #13396b 0%,#3e6599 100%); /* Chrome10+,Safari5.1+ */
background: -o-linear-gradient(top, #13396b 0%,#3e6599 100%); /* Opera 11.10+ */
background: -ms-linear-gradient(top, #13396b 0%,#3e6599 100%); /* IE10+ */
background: linear-gradient(to bottom, #13396b 0%,#3e6599 100%); /* W3C */
filter: progid:DXImageTransform.Microsoft.gradient( startColorstr='#13396b', endColorstr='#3e6599',GradientType=0 ); /* IE6-8 */
height: 100%;
}
body {
width: 960px;
overflow: hidden;
margin: 50px auto;
font-family: "HelveticaNeue-Light", "Helvetica Neue Light", "Helvetica Neue", Helvetica, Arial, "Lucida Grande", sans-serif;
font-size: 14px;
color: #a2c3ef;
}
h1,h2 {
color: #fff;
}
h1 {
font-size: 32px;
font-weight: normal;
}
h2 {
font-size: 24px;
font-weight: lighter;
}
a {
color: #fff;
font-weight: bold;
}
#content {
margin: 20px 0 20px 30px;
}
blockquote#error, blockquote#data {
color: #fff;
font-size: 16px;
}
#footer p {
font-size: 12px;
padding: 7px 12px;
margin-top: 10px;
color: #fff;
text-align: right;
}
</style>
<!--[if gte IE 9]>
<style type="text/css">
.gradient {
filter: none;
}
</style>
<![endif]-->
</head>
<body id=ERR_ACCESS_DENIED>
<div id="titles">
<h1>ERROR</h1>
<h2>Unable to complete URL request</h2>
</div>
<hr>
<div id="content">
<p>An error has occurred while trying to access <a href="http://<server_ip>:8050/render.html?">http://<server_ip>:8050/render.html?</a>.</p>
<blockquote id="error">
<p><b>Access denied.</b></p>
</blockquote>
<p>Security permissions are not allowing the request attempt. Please contact your service provider if you feel this is incorrect.</p>
</div>
<hr>
<div id="footer">
</div>
</body>
</html>
C:\Users\Dr. Printer>curl "http://<server_ip>:8050/render.html?url=http://www.example.com/?timeout=30&wait=0.5"
{"description": "Timeout exceeded rendering page", "type": "GlobalTimeoutError", "info": {"timeout": 30.0}, "error": 504}
答案 0 :(得分:0)
如果我们想使用Crawlera作为代理,我们可以使用这个lua脚本来实现
function use_crawlera(splash)
-- Make sure you pass your Crawlera API key in the 'crawlera_user' arg.
-- Have a look at the file spiders/quotes-js.py to see how to do it.
-- Find your Crawlera credentials in https://app.scrapinghub.com/
local user = splash.args.crawlera_user
local host = 'proxy.crawlera.com'
local port = 8010
local session_header = 'X-Crawlera-Session'
local session_id = 'create'
splash:on_request(function (request)
-- The commented code below can be used to speed up the crawling
-- process. They filter requests to undesired domains and useless
-- resources. Uncomment the ones that make sense to your use case
-- and add your own rules.
-- Discard requests to advertising and tracking domains.
if string.find(request.url, 'doubleclick%.net') or
string.find(request.url, 'analytics%.google%.com') then
request.abort()
return
end
-- Avoid using Crawlera for subresources fetching to increase crawling
-- speed. The example below avoids using Crawlera for URLS starting
-- with 'static.' and the ones ending with '.png'.
if string.find(request.url, '://static%.') ~= nil or
string.find(request.url, '%.png$') ~= nil then
return
end
request:set_header('X-Crawlera-Cookies', 'disable')
request:set_header(session_header, session_id)
request:set_proxy{{host, port, username=user, password=''}}
end)
splash:on_response_headers(function (response)
if type(response.headers[session_header]) ~= nil then
session_id = response.headers[session_header]
end
end)
end
function main(splash)
use_crawlera(splash)
splash:init_cookies(splash.args.cookies)
assert(splash:go{{
splash.args.url,
headers=splash.args.headers,
http_method=splash.args.http_method,
}})
assert(splash:wait({0}))
return {{
html = splash:html(),
cookies = splash:get_cookies(),
}}
end
不要忘记安装scrapy-crawlera
并在设置中将其激活。有关更多信息,请参阅https://support.scrapinghub.com/support/solutions/articles/22000188428-using-crawlera-with-splash-scrapy