Question

我编写代码来解析google使用proxies的结果。我使用Python3 但得到错误或503服务不可用或403 Forbitten或没有连接。

我做错了什么？

我的代码：

header = "Mozilla/5.001 (windows; U; NT4.0; en-US; rv:1.0) Gecko/25250101"
candidate_proxies = ['http://54.183.219.170:80']
for proxy in candidate_proxies:
    print("Trying HTTP proxy %s" % proxy)
    try:
        proxy_support = urllib.request.ProxyHandler({'http' : proxy})
        request = urllib.request.Request(url)
        request.add_header("User-Agent", header)
        opener = urllib.request.build_opener(proxy_support)
        urllib.request.install_opener(opener)
        response = urllib.request.urlopen(request)
        html = response.read()
        print("Got URL using proxy %s" % proxy)
        return html
        #result = urllib.urlopen("http://www.google.com", proxies={'http': proxy})
        break
    except urllib.error.HTTPError as e:
        print("Error accessing:", url)
        if e.code == 503 and 'CaptchaRedirect' in e.read():
            print("Google is requiring a Captcha. For more information see: 'https://support.google.com/websearch/answer/86640'")
        print("Trying next proxy in 5 seconds")
        time.sleep(5)
    except Exception as e:
        print("Error accessing:", url)
        print(e)
        return None
return None

问题：

为什么Google会检测我的代理以及如何正确执行？

Answer 1

您应该考虑使用像Proxicity.io（https://www.proxicity.io）这样的服务。您可以搜索支持Google的代理，并在每个API请求中获得新的经过验证的代理。您也可以免费使用该服务！该服务的一个功能是针对常见主机（Google，Amazon，Craigslist等）检查代理，并能够使用supportedWebsites字段查询它们。

您可以使用以下内容轻松获得经过验证和验证的新代理：

 response = requests.get('https://api.proxicity.io/v2/<OPTIONAL-API-KEY>/proxy)
 proxy = response.json()['curl']  # curl returns protocol://ip:port format

完全披露：我是这个项目的首席开发人员。开发它供其他开发人员使用。再也不想让StackOverflow遇到麻烦来共享这项服务。

Answer 2

您需要专门的代理才能实现此目的。您可以轻松找到可靠的提供商。

然后循环代理。

from itertools import cycle

Python使用代理解析谷歌

2 个答案: