Question

我正在构建一个废弃谷歌搜索结果的脚本。我到达了这里。

import urllib keyword = "google" print urllib.urlopen("https://www.google.co.in/search?q=" + keyword).read()

但它给了我一个回复如下：

<!DOCTYPE html><html lang=en><meta charset=utf-8><meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width"><title>Error 403 (Forbidden)!!1</title><style>*{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/errors/logo_sm_2.png) no-repeat}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/errors/logo_sm_2_hr.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/errors/logo_sm_2_hr.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/errors/logo_sm_2_hr.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:55px;width:150px}</style><a href=//www.google.com/></a>403. <ins>Thatâ€™s an error.</ins>Your client does not have permission to get URL <code>/search?q=google</code> from this server. (Client IP address: 117.196.168.89) Please see Google's Terms of Service posted at http://www.google.com/terms_of_service.html If you believe that you have received this response in error, please <A HREF="http://www.google.com/support/bin/request.py?contact_type=user&hl=en">report</A> your problem. However, please make sure to take a look at our Terms of Service (http://www.google.com/terms_of_service.html). In your email, please send us the entire code displayed below. Please also send us any information you may know about how you are performing your Google searches-- for example, "I'm using the Opera browser on Linux to do searches from home. My Internet access is through a dial-up account I have with the FooCorp ISP." or "I'm using the Konqueror browser on Linux to search from my job at myFoo.com. My machine's IP address is 10.20.30.40, but all of myFoo's web traffic goes through some kind of proxy server whose IP address is 10.11.12.13." (If you don't know any information like this, that's OK. But this kind of information can help us track down problems, so please tell us what you can.)We will use all this information to diagnose the problem, and we'll hopefully have you back up and searching with Google again quickly! Please note that although we read all the email we receive, we are not always able to send a personal response to each and every email. So don't despair if you don't hear back from us! Also note that if you do not send us the entire code below, we will not be able to help you.Best wishes, The Google Team<BLOCKQUOTE>/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/ aef0-l8vNw3cWys_OWGKrv6VYDewUx0bhWxSeo2Mk4vGTZSoh MdeNZki3vp-kzRGjrBTseg6uGBypibuTNGSeJoPRkDPCOFkyA YBVgssaJaqSibV7khohBnsUVRVZqALwIe2lD6pdddMQIZ-Zg2 WEE-rO-ZackE5L2gwlmHZHP2oWML3ZlGgUL6CAbMbFmzVda38 ZYYVZLKBcjY1gSLk-FSzBc7QQnp0vrhkY6LnrALX94oK7Yrml bKX-5KmpyhsI7aW3da5Rt5nt0K9PVPbKvpZ1LN-hdRqg749K6 T4v8mGfuH6BHSQUAPW1Byx_Wy1TGsyhZJQ02jrz7K0RBg4r0i 9O6Rs7-FFRzESkiyzRQaExUdpBpl3Mmguh1JXR_yxDJre9R7u 3AWKfCkt8BxKuv37oAIslM2Caor4QBXSNrq1F7zUetx8HxmaW pX_6KsXyjs3-Pfq5NKOuzNCjatrhXdKC74NmNHztTPJU-4MzV kUPuUehnDYgcgGAVYLLGiWvG4Scm8G2Gq2UnacMQsZ5BB7rgY DXJnZwbMbVX53-llhCMeQfBTteOWIfWQR2FOyc-tuaRHX6c3N rzpNDX9ZufFfOXRNkaORCZxkSEoX1xDBq0VGdkkCfwlUdG9Jq prYBPnpRyhjxjC3c4n68AuEYHtMTVmbK-fyMtcWLMTVXzIrYS EjACpMTnHRavhYza4ZJgs4SViS4FrsmJ0P3CdyLLayR0xMFM6 m7rxy-zaABo7iof_re5PKcFP6EYqD0Wm-ZlLksUh2a1LVaAsq sSqnPPqq5qCu0z8wQe5jeGCRCY2vrT5HWmYNJbhyCyN_HiHGR bHDb8f3_OcgAHsT7zv1a4FOG4B0JztqskzYmssBb-ezvErkp6 uZtwiKJc30F30RpQhKEb_rPjhpwc5dr3MUsTuki2j2tBSQl_O kjFef_Jvl3u8TPQY5c6dqUSQv--p0N95Jv-WehS32lvyUbeEB mN7ZC8oCFj06BRn5NaU9P8p1d7fmYyxyta2dZ21UfaRMhX8TZ VgKiSDVyMO2GZ09bUEFGW4KvvTJDyQT_UMkCsahrv2MP_yI-D fwEArSXvPIpyESHeyPXfFN-Z9_OuVwGDU2riHFIWgw5IPwtER e0Ukzrn2iwGHHL8j2JdSNbunrifS-RqkK2hgQl16-TfqN11NL Lgwtt-Kp3XL86K61Qq7lU-NxB8BOO_i-QOQszn6uRmb3VR__Q T_0E9FULbsR9kgTyXDKQmOQ-3qeaFlz4in9V9PJ +/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+ </BLOCKQUOTE>

谷歌是否允许其网页被废弃？

Answer 1

实际上，谷歌没有，在某种意义上它阻止机器人。但是你可以使用mechanize假冒浏览器并获得结果。


    import mechanize
    chrome = mechanize.Browser()
    chrome.set_handle_robots(False)
    chrome.addheaders = [('User-agent', 
    'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36')]
    base_url = 'https://www.google.co.in/search?q='
    search_url = base_url + keyword.replace(' ', '+')
    htmltext = chrome.open(search_url).read()

试试这个。我希望它有所帮助。

Answer 2

您还可以伪造headers中的urllib来获取结果。

类似的东西：

import urllib2

keyword = "google"
url = "https://www.google.co.in/search?q=" + keyword

# Build a opener
opener = urllib2.build_opener()

# In case you have proxy then u need to build a ProxyHandler opener 
#opener = urllib2.build_opener(urllib2.ProxyHandler(proxies={"http": "http://proxy.corp.ads:8080"}))

# To fake the browser
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
print opener.open(url).read()

Answer 3

Google 使用不同的 user-agent 处理您的脚本（如果您使用的是 requests，它将是 python-requests）请参阅 more 和 more。

您只需要指定浏览器 user-agent（Chrome、Mozilla、Edge、IE、Safari..），这样 Google 就会将其视为“用户”，也就是假冒真正的浏览器访问。

如果您使用的是 requests 库，那么您可以通过这种方式指定它（user-agents 其他网站的列表）

import requests

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

response = requests.get(
  'https://www.google.com/search?q=pizza is awesome', headers=headers).text

我用示例代码 here 回答了有关如何抓取 Google 搜索结果标题、摘要和链接的问题。

或者，您可以使用第三方 Google Search Engine Results API 或来自 SerpApi 的 Google Organic Results API。这是一个免费试用的付费 API。

查看 Playground 进行测试并查看输出。

获取原始 HTML 响应的代码：

import os, urllib
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "london",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

html = results['search_metadata']['raw_html_file']
print(urllib.request.urlopen(html).read())

<块引用>

免责声明，我为 SerpApi 工作。

无法获得谷歌搜索结果python

3 个答案: