我一般都不熟悉Python(Selenium,Scrapy等)和Web爬网,但是我对Java等其他语言非常熟悉,因此,如果我缺少一些非常简单的内容,请原谅我!
我的最终目标是访问一个页面,坐在该页面约10秒钟,然后关闭浏览器并重复。但是,我正在尝试对每个请求都通过代理来轮换我的IP地址。我已经能够完成访问页面的操作,但是当我尝试将旋转的Proxy混合使用时,出现了一个长连接错误,我似乎无法弄清楚其中似乎包含了很多CSS。
完整代码段
问题似乎是由驱动程序试图访问网站的try-block中的第二行引起的。
import scrapy
import requests
from selenium import webdriver
from selenium.webdriver.common.proxy import Proxy, ProxyType
from scrapy.http import Request
from lxml.html import fromstring
from itertools import cycle
class VisitPageSpider(scrapy.Spider):
name = 'visitpage'
allowed_domains = ['books.toscrape.com']
def start_requests(self):
test_url = 'http://books.toscrape.com'
proxies = self.get_proxies()
proxy_pool = cycle(proxies)
prox = Proxy()
prox.proxy_type = ProxyType.MANUAL
view_count = 0
url = 'https://httpbin.org/ip'
for i in range(1, 11):
proxy = next(proxy_pool)
prox.http_proxy = proxy
prox.socks_proxy = proxy
prox.ssl_proxy = proxy
capabilities = webdriver.DesiredCapabilities.INTERNETEXPLORER
prox.add_to_capabilities(capabilities)
print("Request #%d" % i)
try:
self.driver = webdriver.Ie(desired_capabilities=capabilities)
self.driver.get(test_url)
view_count += 1
time.sleep(10)
self.driver.quit()
except:
print("Skipping. Connection error")
print('Total New Views ' + view_count)
yield Request(test_url, callback=self.visit_page)
def visit_page(self, response):
pass
def get_proxies(self):
url = 'https://free-proxy-list.net/'
response = requests.get(url)
parser = fromstring(response.text)
proxies = set()
for i in parser.xpath('//tbody/tr')[:10]:
if i.xpath('.//td[7][contains(text(),"yes")]'):
proxy = ":".join([i.xpath('.//td[1]/text()')[0], i.xpath('.//td[2]/text()')[0]])
proxies.add(proxy)
print(proxies)
return proxies
CMD输出
分别在try块的前两行
2018-07-26 18:19:21 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:52898/session {"capabilities": {"firstMatch": [{}], "alwaysMatch": {"browserName": "internet explorer", "platformName": "windows", "proxy": {"proxyType": "manual", "httpProxy": "46.227.162.167:8080", "sslProxy": "46.227.162.167:8080", "socksProxy": "46.227.162.167:8080"}}}, "desiredCapabilities": {"browserName": "internet explorer", "version": "", "platform": "WINDOWS", "proxy": {"proxyType": "MANUAL", "httpProxy": "46.227.162.167:8080", "sslProxy": "46.227.162.167:8080", "socksProxy": "46.227.162.167:8080"}}}
2018-07-26 18:19:21 [selenium.webdriver.remote.remote_connection] DEBUG: b'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">\n<html><head>\n<meta type="copyright" content="Copyright (C) 1996-2015 The Squid Software Foundation and contributors">\n<meta http-equiv="Content-Type" CONTENT="text/html; charset=utf-8">\n<title>ERROR: The requested URL could not be retrieved</title>\n<style type="text/css"><!-- \n /*\n * Copyright (C) 1996-2016 The Squid Software Foundation and contributors\n *\n * Squid software is distributed under GPLv2+ license and includes\n * contributions from numerous individuals and organizations.\n * Please see the COPYING and CONTRIBUTORS files for details.\n */\n\n/*\n Stylesheet for Squid Error pages\n Adapted from design by Free CSS Templates\n http://www.freecsstemplates.org\n Released for free under a Creative Commons Attribution 2.5 License\n*/\n\n/* Page basics */\n* {\n\tfont-family: verdana, sans-serif;\n}\n\nhtml body {\n\tmargin: 0;\n\tpadding: 0;\n\tbackground: #efefef;\n\tfont-size: 12px;\n\tcolor: #1e1e1e;\n}\n\n/* Page displayed title area */\n#titles {\n\tmargin-left: 15px;\n\tpadding: 10px;\n\tpadding-left: 100px;\n\tbackground: url(\'/squid-internal-static/icons/SN.png\') no-repeat left;\n}\n\n/* initial title */\n#titles h1 {\n\tcolor: #000000;\n}\n#titles h2 {\n\tcolor: #000000;\n}\n\n/* special event: FTP success page titles */\n#titles ftpsuccess {\n\tbackground-color:#00ff00;\n\twidth:100%;\n}\n\n/* Page displayed body content area */\n#content {\n\tpadding: 10px;\n\tbackground: #ffffff;\n}\n\n/* General text */\np {\n}\n\n/* error brief description */\n#error p {\n}\n\n/* some data which may have caused the problem */\n#data {\n}\n\n/* the error message received from the system or other software */\n#sysmsg {\n}\n\npre {\n font-family:sans-serif;\n}\n\n/* special event: FTP / Gopher directory listing */\n#dirmsg {\n font-family: courier;\n color: black;\n font-size: 10pt;\n}\n#dirlisting {\n margin-left: 2%;\n margin-right: 2%;\n}\n#dirlisting tr.entry td.icon,td.filename,td.size,td.date {\n border-bottom: groove;\n}\n#dirlisting td.size {\n width: 50px;\n text-align: right;\n padding-right: 5px;\n}\n\n/* horizontal lines */\nhr {\n\tmargin: 0;\n}\n\n/* page displayed footer area */\n#footer {\n\tfont-size: 9px;\n\tpadding-left: 10px;\n}\n\n\nbody\n:lang(fa) { direction: rtl; font-size: 100%; font-family: Tahoma, Roya, sans-serif; float: right; }\n:lang(he) { direction: rtl; }\n --></style>\n</head><body id=ERR_CONNECT_FAIL>\n<div id="titles">\n<h1>ERROR</h1>\n<h2>The requested URL could not be retrieved</h2>\n</div>\n<hr>\n\n<div id="content">\n<p>The following error was encountered while trying to retrieve the URL: <a href="http://127.0.0.1:52898/session">http://127.0.0.1:52898/session</a></p>\n\n<blockquote id="error">\n<p><b>Connection to 127.0.0.1 failed.</b></p>\n</blockquote>\n\n<p id="sysmsg">The system returned: <i>(111) Connection refused</i></p>\n\n<p>The remote host or network may be down. Please try the request again.</p>\n\n<p>Your cache administrator is <a href="mailto:webmaster?subject=CacheErrorInfo%20-%20ERR_CONNECT_FAIL&body=CacheHost%3A%20vps188962%0D%0AErrPage%3A%20ERR_CONNECT_FAIL%0D%0AErr%3A%20(111)%20Connection%20refused%0D%0ATimeStamp%3A%20Fri,%2027%20Jul%202018%2004%3A19%3A20%20GMT%0D%0A%0D%0AClientIP%3A%2072.234.175.171%0D%0AServerIP%3A%20127.0.0.1%0D%0A%0D%0AHTTP%20Request%3A%0D%0APOST%20%2Fsession%20HTTP%2F1.1%0AAccept-Encoding%3A%20identity%0D%0AContent-Length%3A%20501%0D%0AAccept%3A%20application%2Fjson%0D%0AContent-Type%3A%20application%2Fjson%3Bcharset%3DUTF-8%0D%0AUser-Agent%3A%20selenium%2F3.13.0%20(python%20windows)%0D%0AConnection%3A%20close%0D%0AHost%3A%20127.0.0.1%3A52898%0D%0A%0D%0A%0D%0A">webmaster</a>.</p>\n\n<br>\n</div>\n\n<hr>\n<div id="footer">\n<p>Generated Fri, 27 Jul 2018 04:19:20 GMT by vps188962 (squid/3.5.23)</p>\n<!-- ERR_CONNECT_FAIL -->\n</div>\n</body></html>\n'
答案 0 :(得分:0)
我的猜测是这是这些代理的问题。免费代理通常是不可靠的(根据我的经验-经常),您必须准备好使它们产生任何现实可能的错误-错误,超时甚至损坏的响应。日志的第二行似乎是来自鱿鱼代理软件的一般响应,表明在这种情况下出现代理错误。