Google搜索在AppEngine上返回无302

时间:2013-12-22 04:48:43

标签: python api google-app-engine http search-engine

我正在查询Google搜索引擎,并通过返回预期结果在本地运行正常。在AppEngine上部署相同的代码时,它返回None 302。

以下程序会返回Google搜索结果中返回的链接。

# The first two imports will be slightly different when deployed on appengine
from pyquery import PyQuery as pq
import requests
import random
try:
    from urllib.parse import quote as url_quote
except ImportError:
    from urllib import quote as url_quote

USER_AGENTS = ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20100101 Firefox/11.0',
               'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100 101 Firefox/22.0',
               'Mozilla/5.0 (Windows NT 6.1; rv:11.0) Gecko/20100101 Firefox/11.0',
               'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.46 Safari/536.5',
               'Mozilla/5.0 (Windows; Windows NT 6.1) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.46 Safari/536.5',)


SEARCH_URL = 'https://www.google.com/search?q=site:foobar.com%20{0}'

def get_result(url):
    return requests.get(url, headers={'User-Agent': random.choice(USER_AGENTS)}).text


def get_links(query):
    result = get_result(SEARCH_URL.format(url_quote(query)))
    html = pq(result)
    return [a.attrib['href'] for a in html('.l')] or \
        [a.attrib['href'] for a in html('.r')('a')]

print get_links('foo bar')

在AppEngine上部署的代码:

import sys
sys.path[0:0] = ['distlibs']

import lxml
import webapp2
import json
from requests import api
from pyquery.pyquery import PyQuery as pq
import random

try:
    from urllib.parse import quote as url_quote
except ImportError:
    from urllib import quote as url_quote


USER_AGENTS = ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20100101 Firefox/11.0',
               'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100 101 Firefox/22.0',
               'Mozilla/5.0 (Windows NT 6.1; rv:11.0) Gecko/20100101 Firefox/11.0',
               'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.46 Safari/536.5',
               'Mozilla/5.0 (Windows; Windows NT 6.1) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.46 Safari/536.5',)


SEARCH_URL = 'https://www.google.com/search?q=site:foobar.com%20{0}'



def get_result(url):
    return api.get(url, headers={'User-Agent': random.choice(USER_AGENTS)}).text


def get_links(query):
    result = get_result(SEARCH_URL.format(url_quote(query)))
    html = pq(result)
    return [a.attrib['href'] for a in html('.l')] or \
        [a.attrib['href'] for a in html('.r')('a')]


form="""
<form action="/process">
    <input name="q">
    <input type="submit">
</form>
"""


class MainHandler(webapp2.RequestHandler):
    def get(self):
        self.response.out.write("<h3>Write something.</h3><br>")
        self.response.out.write(form)


class ProcessHandler(webapp2.RequestHandler):
    def get(self):
        query = self.request.get("q")
        self.response.out.write("Your query : " + query)
        results = get_links(query)
        self.response.out.write(results[0])



app = webapp2.WSGIApplication([('/', MainHandler),
                               ('/process', ProcessHandler)],
                               debug=True)

我尝试过使用http和https协议进行查询。以下是请求的AppEngine日志。

Starting new HTTP connection (1): www.google.com
D 2013-12-21 13:13:37.217
"GET /search?q=site:foobar.com%20foo%20bar HTTP/1.1" 302 None
I 2013-12-21 13:13:37.218
Starting new HTTP connection (1): ipv4.google.com
D 2013-12-21 13:13:37.508
"GET /sorry/IndexRedirect?continue=http://www.google.com/search%3Fq%3Dsite:foobar.com%20foo%20bar HTTP/1.1" 403 None
E 2013-12-21 20:51:32.090
list index out of range

1 个答案:

答案 0 :(得分:0)

我很困惑你为什么试图欺骗User-Agent标题,但是如果让你开心,那就去吧。请注意,如果requests.get正在使用urlfetch,则App Engine会将一个字符串附加到您的应用提供的User-Agent标头,以识别您的应用。 (见https://developers.google.com/appengine/docs/python/urlfetch/#Python_Request_headers)。

尝试将follow_redirects = False传递给urlfetch。这就是您向其他App Engine应用程序发出请求的方式。出于完全不明显的原因,在这种情况下它可能会有所帮助。