使用Python进行Google搜索时出错:503服务不可用

时间:2017-04-26 09:47:26

标签: python python-3.x google-search

当我尝试在python控制台中执行时:

from google import search
urls = search("site:facebook.com inurl:login", stop=20)
for url in urls:
    print(url)

为了搜索登录页面,我收到错误:

urllib.error.HTTPError: HTTP Error 503: Service Unavailable

但是,如果我尝试在谷歌手动搜索它可行,可能会谷歌阻止我的查询?

2 个答案:

答案 0 :(得分:3)

Cong Ma一样,在他的回答中说,谷歌上的许多自动搜索会导致谷歌阻止你,你会得到错误503.只有来自谷歌的API才能进行当前正在工作的语言{{ 3}}。问题在于它旨在搜索您的网页。并且可以选择将其设置为搜索所有页面(请参阅此Google Custom Search API),但即使这样,您每天只能有100个seacrhes。之前有选择使用其他API,但像Bing和Yahoo,但它们都不再是免费的。只有进行互联网搜索的免费API才是answer。但是,使用FAROO API进行Google搜索仍有一个选项。 Selenium用于模仿浏览器使用情况,selenium webdriver使用Firefox,Chrome,Edge或Safari网络驱动程序(它实际上会打开Chrome并进行搜索),但这很烦人,因为您实际上并不想看到浏览器。但是有解决方案可以使用options。从PhantomJS下载。提取并看看如何在下面的例子中使用它(我写了一个你可以使用的简单类,你只需要改变PhantomJS的路径):

import time
from urllib.parse import quote_plus
from selenium import webdriver


class Browser:

    def __init__(self, path, initiate=True, implicit_wait_time = 10, explicit_wait_time = 2):
        self.path = path
        self.implicit_wait_time = implicit_wait_time    # http://www.aptuz.com/blog/selenium-implicit-vs-explicit-waits/
        self.explicit_wait_time = explicit_wait_time    # http://www.aptuz.com/blog/selenium-implicit-vs-explicit-waits/
        if initiate:
            self.start()
        return

    def start(self):
        self.driver = webdriver.PhantomJS(path)
        self.driver.implicitly_wait(self.implicit_wait_time)
        return

    def end(self):
        self.driver.quit()
        return

    def go_to_url(self, url, wait_time = None):
        if wait_time is None:
            wait_time = self.explicit_wait_time
        self.driver.get(url)
        print('[*] Fetching results from: {}'.format(url))
        time.sleep(wait_time)
        return

    def get_search_url(self, query, page_num=0, per_page=10, lang='en'):
        query = quote_plus(query)
        url = 'https://www.google.hr/search?q={}&num={}&start={}&nl={}'.format(query, per_page, page_num*per_page, lang)
        return url

    def scrape(self):
        #xpath migth change in future
        links = self.driver.find_elements_by_xpath("//h3[@class='r']/a[@href]") # searches for all links insede h3 tags with class "r"
        results = []
        for link in links:
            d = {'url': link.get_attribute('href'),
                 'title': link.text}
            results.append(d)
        return results

    def search(self, query, page_num=0, per_page=10, lang='en', wait_time = None):
        if wait_time is None:
            wait_time = self.explicit_wait_time
        url = self.get_search_url(query, page_num, per_page, lang)
        self.go_to_url(url, wait_time)
        results = self.scrape()
        return results




path = '<YOUR PATH TO PHANTOMJS>/phantomjs-2.1.1-windows/bin/phantomjs.exe' ## SET YOU PATH TO phantomjs
br = Browser(path)
results = br.search('site:facebook.com inurl:login')
for r in results:
    print(r)

br.end()

答案 1 :(得分:1)

谷歌确实试图阻止&#34;意外&#34;来自的查询。在普通的浏览器UI中,它将提供验证码。它将考虑流量模式(使用&#34;智能&#34;查询,已知由垃圾邮件发送者使用的IP块)以及客户端的行为进行过快搜索。

您可以通过捕获它来检查错误的详细信息。

try:
    urls = search("site:facebook.com inurl:login", stop=20)
except urllib.error.HTTPError as httperr:
    print(httperr.headers)  # Dump the headers to see if there's more information
    print(httperr.read())   # You can even read this error object just like a normal response file