使用谷歌api获得前10个谷歌搜索结果

时间:2010-12-14 16:54:56

标签: python json google-api

我需要获得前10个谷歌搜索结果

例如:

... query = urllib.urlencode({'q' : 'example'})
... 
... url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' \
... % (query)
... search_results = urllib.urlopen(url)
... json = simplejson.loads(search_results.read())
... results = json['responseData']['results']

这会给我第一页的结果,但我想获得更多谷歌搜索结果,有人知道该怎么做吗?

4 个答案:

答案 0 :(得分:3)

我过去做过,这里有完整的例子(我不是python guru,但它有效):

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys, getopt
import urllib
import simplejson

OPTIONS = ("m:", ["min="])

def print_usage():
    s = "usage: " + sys.argv[0] + " "
    for o in OPTIONS[0]:
        if o != ":" : s += "[-" + o + "] "
    print(s + "query_string\n")

def search(query, index, offset, min_count, quiet=False, rs=[]):
    url = "http://ajax.googleapis.com/ajax/services/search/web?v=1.0&rsz=large&%s&start=%s" % (query, offset)
    result = urllib.urlopen(url)
    json = simplejson.loads(result.read())
    status = json["responseStatus"]
    if status == 200:
        results = json["responseData"]["results"]
        cursor = json["responseData"]["cursor"]
        pages = cursor["pages"]
        for r in results:
            i = results.index(r) + (index -1) * len(results) + 1
            u = r["unescapedUrl"]
            rs.append(u)
            if not quiet:
                print("%3d. %s" % (i, u))
        next_index  = None
        next_offset = None
        for p in pages:
            if p["label"] == index:
                i = pages.index(p)
                if i < len(pages) - 1:
                    next_index  = pages[i+1]["label"]
                    next_offset = pages[i+1]["start"]
                break
        if next_index != None and next_offset != None:
            if int(next_offset) < min_count:
                search(query, next_index, next_offset, min_count, quiet, rs)
    return rs

def main():
    min_count = 64
    try:
        opts, args = getopt.getopt(sys.argv[1:], *OPTIONS)
        for opt, arg in opts:
            if opt in ("-m", "--min"):
                min_count = int(arg)
        assert len(args) > 0
    except:
        print_usage()
        sys.exit(1)
    qs = " ".join(args)
    query = urllib.urlencode({"q" : qs})
    search(query, 1, "0", min_count)

if __name__ == "__main__":
    main()

编辑:我修复了明显的命令行选项错误处理;您可以按如下方式调用此脚本:

python gsearch.py --min=5 vanessa mae

--min开关表示“至少5个结果”并且是可选的,如果未指定,您将获得最大允许结果计数(64)。

此外,为简洁起见,省略了错误处理。

答案 1 :(得分:2)

这是一个老问题,但仍然相关。

两种方式:

  1. beautifulsouprequests python 库。
  2. Google Search Engine Results API 来自 SerpApi。

使用python库的第一种方法(此代码取自我的另一个answer):

from bs4 import BeautifulSoup
import requests
import json

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

html = requests.get('https://www.google.com/search?q=java&oq=java',
                    headers=headers).text

soup = BeautifulSoup(html, 'lxml')

summary = []

for container in soup.findAll('div', class_='tF2Cxc'):
    heading = container.find('h3', class_='LC20lb DKV0Md').text
    article_summary = container.find('span', class_='aCOpRe').text
    link = container.find('a')['href']

    summary.append({
        'Heading': heading,
        'Article Summary': article_summary,
        'Link': link,
    })

print(json.dumps(summary, indent=2, ensure_ascii=False))

输出 JSON:

[
  {
    "Heading": "Java | Oracle",
    "Article Summary": "Java Download. » What is Java? » Need Help? » Uninstall. About Java. Go Java Java Training Java + Greenfoot Oracle Code One Oracle Academy for ...",
    "Link": "https://www.java.com/"
  },
  {
    "Heading": "Oracle Java Technologies | Oracle",
    "Article Summary": "Java Is the Language of Possibilities. Java is powering the innovation behind our digital world. Harness this potential with Java resources for student coders, ...",
    "Link": "https://www.oracle.com/java/technologies/"
  },
  {
    "Heading": "Java Software | Oracle",
    "Article Summary": "Oracle Java. Java is the #1 programming language and development platform. It reduces costs, shortens development timeframes, drives innovation, and ...",
    "Link": "https://www.oracle.com/java/"
  },
  {
    "Heading": "Java - Wikipedia",
    "Article Summary": "Java (Indonesian: Jawa, Indonesian pronunciation: [ˈdʒawa]; Javanese: ꦗꦮ; Sundanese: ᮏᮝ) is an island of Indonesia, bordered by the Indian Ocean to the ...",
    "Link": "https://en.wikipedia.org/wiki/Java"
  },
  {
    "Heading": "Java (programming language) - Wikipedia",
    "Article Summary": "Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible. It is a general-purpose ...",
    "Link": "https://en.wikipedia.org/wiki/Java_(programming_language)"
  },
  {
    "Heading": "JDK Builds from Oracle - Java.net",
    "Article Summary": "Java Development Kit builds, from Oracle. Ready for use: JDK 16, JDK 15, JMC 8. Early access: JDK 17, Lanai, Loom, Metropolis, Panama, & Valhalla.",
    "Link": "https://jdk.java.net/"
  },
  {
    "Heading": "Java Tutorial - Tutorialspoint",
    "Article Summary": "Java is a high-level programming language originally developed by Sun Microsystems and released in 1995. Java runs on a variety of platforms, such as ...",
    "Link": "https://www.tutorialspoint.com/java/index.htm"
  },
  {
    "Heading": "Google Java Style Guide",
    "Article Summary": "This document serves as the complete definition of Google's coding standards for source code in the Java™ Programming Language. A Java source file is ...",
    "Link": "https://google.github.io/styleguide/javaguide.html"
  }
]

第二种方法使用来自 SerpApi 的 Google Search Engine Results API。这是一个付费 API,可免费试用 5,000 次搜索(如果您需要额外的保护,例如 API 密钥的 api_keyos.getenv("API_KEY"),请确保您使用 os. environ["API_KEY"] 创建了一个文件,{{ 3}} 和其他 documentation):

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "java",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
   print(f"Title: {result['title']}\nLink: {result['link']}\n")

输出:

Title: Java | Oracle
Link: https://www.java.com/

Title: Oracle Java Technologies | Oracle
Link: https://www.oracle.com/java/technologies/

Title: Java Software | Oracle
Link: https://www.oracle.com/java/

Title: Java - Wikipedia
Link: https://en.wikipedia.org/wiki/Java

Title: Java (programming language) - Wikipedia
Link: https://en.wikipedia.org/wiki/Java_(programming_language)

Title: JDK Builds from Oracle - Java.net
Link: https://jdk.java.net/

Title: Google Java Style Guide
Link: https://google.github.io/styleguide/javaguide.html

Title: Java Tutorial - Tutorialspoint
Link: https://www.tutorialspoint.com/java/index.htm

还有,有点不要脸的插件:

如何使用 Python 抓取 Google 新闻,我回答了 source

如何抓取 Google 地图,我回答了 here

<块引用>

免责声明,我为 SerpApi 工作。

答案 2 :(得分:1)

谷歌现在将此方法称为deprecated,您可能需要尝试:

  

http://code.google.com/apis/customsearch/v1/overview.html

答案 3 :(得分:0)

请参阅文档http://code.google.com/apis/websearch/docs/reference.html#_intro_fonje

您正在寻找启动参数。

没有参数可以在一个响应中获得更多结果,但您可以通过start参数进行迭代。