我正在查询Google搜索引擎,并通过返回预期结果在本地运行正常。在AppEngine上部署相同的代码时,它返回None 302。
以下程序会返回Google搜索结果中返回的链接。
# The first two imports will be slightly different when deployed on appengine
from pyquery import PyQuery as pq
import requests
import random
try:
from urllib.parse import quote as url_quote
except ImportError:
from urllib import quote as url_quote
USER_AGENTS = ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20100101 Firefox/11.0',
'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100 101 Firefox/22.0',
'Mozilla/5.0 (Windows NT 6.1; rv:11.0) Gecko/20100101 Firefox/11.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.46 Safari/536.5',
'Mozilla/5.0 (Windows; Windows NT 6.1) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.46 Safari/536.5',)
SEARCH_URL = 'https://www.google.com/search?q=site:foobar.com%20{0}'
def get_result(url):
return requests.get(url, headers={'User-Agent': random.choice(USER_AGENTS)}).text
def get_links(query):
result = get_result(SEARCH_URL.format(url_quote(query)))
html = pq(result)
return [a.attrib['href'] for a in html('.l')] or \
[a.attrib['href'] for a in html('.r')('a')]
print get_links('foo bar')
在AppEngine上部署的代码:
import sys
sys.path[0:0] = ['distlibs']
import lxml
import webapp2
import json
from requests import api
from pyquery.pyquery import PyQuery as pq
import random
try:
from urllib.parse import quote as url_quote
except ImportError:
from urllib import quote as url_quote
USER_AGENTS = ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20100101 Firefox/11.0',
'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100 101 Firefox/22.0',
'Mozilla/5.0 (Windows NT 6.1; rv:11.0) Gecko/20100101 Firefox/11.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.46 Safari/536.5',
'Mozilla/5.0 (Windows; Windows NT 6.1) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.46 Safari/536.5',)
SEARCH_URL = 'https://www.google.com/search?q=site:foobar.com%20{0}'
def get_result(url):
return api.get(url, headers={'User-Agent': random.choice(USER_AGENTS)}).text
def get_links(query):
result = get_result(SEARCH_URL.format(url_quote(query)))
html = pq(result)
return [a.attrib['href'] for a in html('.l')] or \
[a.attrib['href'] for a in html('.r')('a')]
form="""
<form action="/process">
<input name="q">
<input type="submit">
</form>
"""
class MainHandler(webapp2.RequestHandler):
def get(self):
self.response.out.write("<h3>Write something.</h3><br>")
self.response.out.write(form)
class ProcessHandler(webapp2.RequestHandler):
def get(self):
query = self.request.get("q")
self.response.out.write("Your query : " + query)
results = get_links(query)
self.response.out.write(results[0])
app = webapp2.WSGIApplication([('/', MainHandler),
('/process', ProcessHandler)],
debug=True)
我尝试过使用http和https协议进行查询。以下是请求的AppEngine日志。
Starting new HTTP connection (1): www.google.com
D 2013-12-21 13:13:37.217
"GET /search?q=site:foobar.com%20foo%20bar HTTP/1.1" 302 None
I 2013-12-21 13:13:37.218
Starting new HTTP connection (1): ipv4.google.com
D 2013-12-21 13:13:37.508
"GET /sorry/IndexRedirect?continue=http://www.google.com/search%3Fq%3Dsite:foobar.com%20foo%20bar HTTP/1.1" 403 None
E 2013-12-21 20:51:32.090
list index out of range
答案 0 :(得分:0)
我很困惑你为什么试图欺骗User-Agent
标题,但是如果让你开心,那就去吧。请注意,如果requests.get
正在使用urlfetch
,则App Engine会将一个字符串附加到您的应用提供的User-Agent标头,以识别您的应用。 (见https://developers.google.com/appengine/docs/python/urlfetch/#Python_Request_headers)。
尝试将follow_redirects = False
传递给urlfetch
。这就是您向其他App Engine应用程序发出请求的方式。出于完全不明显的原因,在这种情况下它可能会有所帮助。