Question

我一般都不熟悉Python（Selenium，Scrapy等）和Web爬网，但是我对Java等其他语言非常熟悉，因此，如果我缺少一些非常简单的内容，请原谅我！

我的最终目标是访问一个页面，坐在该页面约10秒钟，然后关闭浏览器并重复。但是，我正在尝试对每个请求都通过代理来轮换我的IP地址。我已经能够完成访问页面的操作，但是当我尝试将旋转的Proxy混合使用时，出现了一个长连接错误，我似乎无法弄清楚其中似乎包含了很多CSS。

完整代码段

问题似乎是由驱动程序试图访问网站的try-block中的第二行引起的。

import scrapy
import requests

from selenium import webdriver
from selenium.webdriver.common.proxy import Proxy, ProxyType
from scrapy.http import Request
from lxml.html import fromstring
from itertools import cycle


class VisitPageSpider(scrapy.Spider):
    name = 'visitpage'
    allowed_domains = ['books.toscrape.com']

    def start_requests(self):

        test_url = 'http://books.toscrape.com'

        proxies = self.get_proxies()
        proxy_pool = cycle(proxies)

        prox = Proxy()
        prox.proxy_type = ProxyType.MANUAL

        view_count = 0

        url = 'https://httpbin.org/ip'
        for i in range(1, 11):

            proxy = next(proxy_pool)
            prox.http_proxy = proxy
            prox.socks_proxy = proxy
            prox.ssl_proxy = proxy

            capabilities = webdriver.DesiredCapabilities.INTERNETEXPLORER

            prox.add_to_capabilities(capabilities)

            print("Request #%d" % i)

            try:
                self.driver = webdriver.Ie(desired_capabilities=capabilities)
                self.driver.get(test_url)
                view_count += 1

                time.sleep(10)
                self.driver.quit()
            except:
                print("Skipping. Connection error")

        print('Total New Views ' + view_count)
        yield Request(test_url, callback=self.visit_page)

    def visit_page(self, response):
        pass

    def get_proxies(self):

        url = 'https://free-proxy-list.net/'
        response = requests.get(url)
        parser = fromstring(response.text)
        proxies = set()
        for i in parser.xpath('//tbody/tr')[:10]:
            if i.xpath('.//td[7][contains(text(),"yes")]'):
                proxy = ":".join([i.xpath('.//td[1]/text()')[0], i.xpath('.//td[2]/text()')[0]])
                proxies.add(proxy)
                print(proxies)
        return proxies

CMD输出

分别在try块的前两行

2018-07-26 18:19:21 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:52898/session {"capabilities": {"firstMatch": [{}], "alwaysMatch": {"browserName": "internet explorer", "platformName": "windows", "proxy": {"proxyType": "manual", "httpProxy": "46.227.162.167:8080", "sslProxy": "46.227.162.167:8080", "socksProxy": "46.227.162.167:8080"}}}, "desiredCapabilities": {"browserName": "internet explorer", "version": "", "platform": "WINDOWS", "proxy": {"proxyType": "MANUAL", "httpProxy": "46.227.162.167:8080", "sslProxy": "46.227.162.167:8080", "socksProxy": "46.227.162.167:8080"}}}

2018-07-26 18:19:21 [selenium.webdriver.remote.remote_connection] DEBUG: b'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">\n<html><head>\n<meta type="copyright" content="Copyright (C) 1996-2015 The Squid Software Foundation and contributors">\n<meta http-equiv="Content-Type" CONTENT="text/html; charset=utf-8">\n<title>ERROR: The requested URL could not be retrieved</title>\n<style type="text/css"><!-- \n /*\n * Copyright (C) 1996-2016 The Squid Software Foundation and contributors\n *\n * Squid software is distributed under GPLv2+ license and includes\n * contributions from numerous individuals and organizations.\n * Please see the COPYING and CONTRIBUTORS files for details.\n */\n\n/*\n Stylesheet for Squid Error pages\n Adapted from design by Free CSS Templates\n http://www.freecsstemplates.org\n Released for free under a Creative Commons Attribution 2.5 License\n*/\n\n/* Page basics */\n* {\n\tfont-family: verdana, sans-serif;\n}\n\nhtml body {\n\tmargin: 0;\n\tpadding: 0;\n\tbackground: #efefef;\n\tfont-size: 12px;\n\tcolor: #1e1e1e;\n}\n\n/* Page displayed title area */\n#titles {\n\tmargin-left: 15px;\n\tpadding: 10px;\n\tpadding-left: 100px;\n\tbackground: url(\'/squid-internal-static/icons/SN.png\') no-repeat left;\n}\n\n/* initial title */\n#titles h1 {\n\tcolor: #000000;\n}\n#titles h2 {\n\tcolor: #000000;\n}\n\n/* special event: FTP success page titles */\n#titles ftpsuccess {\n\tbackground-color:#00ff00;\n\twidth:100%;\n}\n\n/* Page displayed body content area */\n#content {\n\tpadding: 10px;\n\tbackground: #ffffff;\n}\n\n/* General text */\np {\n}\n\n/* error brief description */\n#error p {\n}\n\n/* some data which may have caused the problem */\n#data {\n}\n\n/* the error message received from the system or other software */\n#sysmsg {\n}\n\npre {\n    font-family:sans-serif;\n}\n\n/* special event: FTP / Gopher directory listing */\n#dirmsg {\n    font-family: courier;\n    color: black;\n    font-size: 10pt;\n}\n#dirlisting {\n    margin-left: 2%;\n    margin-right: 2%;\n}\n#dirlisting tr.entry td.icon,td.filename,td.size,td.date {\n    border-bottom: groove;\n}\n#dirlisting td.size {\n    width: 50px;\n    text-align: right;\n    padding-right: 5px;\n}\n\n/* horizontal lines */\nhr {\n\tmargin: 0;\n}\n\n/* page displayed footer area */\n#footer {\n\tfont-size: 9px;\n\tpadding-left: 10px;\n}\n\n\nbody\n:lang(fa) { direction: rtl; font-size: 100%; font-family: Tahoma, Roya, sans-serif; float: right; }\n:lang(he) { direction: rtl; }\n --></style>\n</head><body id=ERR_CONNECT_FAIL>\n<div id="titles">\n<h1>ERROR</h1>\n<h2>The requested URL could not be retrieved</h2>\n</div>\n<hr>\n\n<div id="content">\n<p>The following error was encountered while trying to retrieve the URL: <a href="http://127.0.0.1:52898/session">http://127.0.0.1:52898/session</a></p>\n\n<blockquote id="error">\n<p><b>Connection to 127.0.0.1 failed.</b></p>\n</blockquote>\n\n<p id="sysmsg">The system returned: <i>(111) Connection refused</i></p>\n\n<p>The remote host or network may be down. Please try the request again.</p>\n\n<p>Your cache administrator is <a href="mailto:webmaster?subject=CacheErrorInfo%20-%20ERR_CONNECT_FAIL&amp;body=CacheHost%3A%20vps188962%0D%0AErrPage%3A%20ERR_CONNECT_FAIL%0D%0AErr%3A%20(111)%20Connection%20refused%0D%0ATimeStamp%3A%20Fri,%2027%20Jul%202018%2004%3A19%3A20%20GMT%0D%0A%0D%0AClientIP%3A%2072.234.175.171%0D%0AServerIP%3A%20127.0.0.1%0D%0A%0D%0AHTTP%20Request%3A%0D%0APOST%20%2Fsession%20HTTP%2F1.1%0AAccept-Encoding%3A%20identity%0D%0AContent-Length%3A%20501%0D%0AAccept%3A%20application%2Fjson%0D%0AContent-Type%3A%20application%2Fjson%3Bcharset%3DUTF-8%0D%0AUser-Agent%3A%20selenium%2F3.13.0%20(python%20windows)%0D%0AConnection%3A%20close%0D%0AHost%3A%20127.0.0.1%3A52898%0D%0A%0D%0A%0D%0A">webmaster</a>.</p>\n\n<br>\n</div>\n\n<hr>\n<div id="footer">\n<p>Generated Fri, 27 Jul 2018 04:19:20 GMT by vps188962 (squid/3.5.23)</p>\n<!-- ERR_CONNECT_FAIL -->\n</div>\n</body></html>\n'

Answer 1

我的猜测是这是这些代理的问题。免费代理通常是不可靠的（根据我的经验-经常），您必须准备好使它们产生任何现实可能的错误-错误，超时甚至损坏的响应。日志的第二行似乎是来自鱿鱼代理软件的一般响应，表明在这种情况下出现代理错误。

无法使用selenium driver.get（）连接

1 个答案: