Question

如何避免被Google阻止通过请求查询其搜索引擎？我遍历日期列表，以便获得列表中每个月像Microsoft Release这样的查询的结果。

我目前正在更改用户代理，并在两次请求之间添加time.sleep中的10s，但是我总是被阻止。如何与我的方法结合使用代理？有一个更好的方法吗？

from bs4 import BeautifulSoup
import requests

http_proxy  = "http://10.10.1.10:3128"
https_proxy = "https://10.10.1.11:1080"
ftp_proxy   = "ftp://10.10.1.10:3128"

proxyDict = { 
          "http"  : http_proxy, 
          "https" : https_proxy, 
          "ftp"   : ftp_proxy
        }

page_response = requests.get('https://www.google.com/search?q=Microsoft+Release&rlz=1C1GCEA_enGB779&tbs=cdr:1,cd_min:'+startDate+',cd_max:'+endDate+'&source=inms&tbm=nws&num=150',\
                                     timeout=60, verify=False, headers={'User-Agent': random.choice(user_agents)}, proxies=proxyDict)
soup = BeautifulSoup(page_response.content, 'html.parser')

然后我得到以下错误：

ConnectTimeout: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=Microsoft+Release&rlz=1C1GCEA_enGB779&tbs=cdr:1,cd_min:'+startDate+',cd_max:'+endDate+'&source=inms&tbm=nws&num=150 (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x1811499358>, 'Connection to 10.10.1.11 timed out. (connect timeout=60)'))

有什么主意如何解决该错误并使其正常工作吗？

Answer 1

一种方法是做这样的事情：

# https://stackoverflow.com/a/13395324/15164646
proxies = {
  'http': 'HTTP_PROXY'
}

变成 (example in the online IDE how to scrape Google Scholar with a proxy)：

requests.get('https://www.google.com/search?q=Microsoft+Release&rlz=1C1GCEA_enGB779&tbs=cdr:1,cd_min:'+startDate+',cd_max:'+endDate+'&source=inms&tbm=nws&num=150:, proxies=proxies, headers=headers).text
...

另一种解决方案是使用来自 SerpApi 的 Google Search Engine Results API。这是一个付费 API，可免费试用 5,000 次搜索。

这个特定示例的主要区别在于您不必维护解析器，从而找到避免被 Google 阻止的方法。它已经为最终用户完成了。查看playground。

<块引用>

免责声明，我为 SerpApi 工作。

使用代理在网络上搜寻Google搜索

1 个答案: