我需要获得前10个谷歌搜索结果
例如:
... query = urllib.urlencode({'q' : 'example'})
...
... url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' \
... % (query)
... search_results = urllib.urlopen(url)
... json = simplejson.loads(search_results.read())
... results = json['responseData']['results']
这会给我第一页的结果,但我想获得更多谷歌搜索结果,有人知道该怎么做吗?
答案 0 :(得分:3)
我过去做过,这里有完整的例子(我不是python guru,但它有效):
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys, getopt
import urllib
import simplejson
OPTIONS = ("m:", ["min="])
def print_usage():
s = "usage: " + sys.argv[0] + " "
for o in OPTIONS[0]:
if o != ":" : s += "[-" + o + "] "
print(s + "query_string\n")
def search(query, index, offset, min_count, quiet=False, rs=[]):
url = "http://ajax.googleapis.com/ajax/services/search/web?v=1.0&rsz=large&%s&start=%s" % (query, offset)
result = urllib.urlopen(url)
json = simplejson.loads(result.read())
status = json["responseStatus"]
if status == 200:
results = json["responseData"]["results"]
cursor = json["responseData"]["cursor"]
pages = cursor["pages"]
for r in results:
i = results.index(r) + (index -1) * len(results) + 1
u = r["unescapedUrl"]
rs.append(u)
if not quiet:
print("%3d. %s" % (i, u))
next_index = None
next_offset = None
for p in pages:
if p["label"] == index:
i = pages.index(p)
if i < len(pages) - 1:
next_index = pages[i+1]["label"]
next_offset = pages[i+1]["start"]
break
if next_index != None and next_offset != None:
if int(next_offset) < min_count:
search(query, next_index, next_offset, min_count, quiet, rs)
return rs
def main():
min_count = 64
try:
opts, args = getopt.getopt(sys.argv[1:], *OPTIONS)
for opt, arg in opts:
if opt in ("-m", "--min"):
min_count = int(arg)
assert len(args) > 0
except:
print_usage()
sys.exit(1)
qs = " ".join(args)
query = urllib.urlencode({"q" : qs})
search(query, 1, "0", min_count)
if __name__ == "__main__":
main()
编辑:我修复了明显的命令行选项错误处理;您可以按如下方式调用此脚本:
python gsearch.py --min=5 vanessa mae
--min
开关表示“至少5个结果”并且是可选的,如果未指定,您将获得最大允许结果计数(64)。
此外,为简洁起见,省略了错误处理。
答案 1 :(得分:2)
这是一个老问题,但仍然相关。
两种方式:
beautifulsoup
、requests
python 库。使用python库的第一种方法(此代码取自我的另一个answer):
from bs4 import BeautifulSoup
import requests
import json
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=java&oq=java',
headers=headers).text
soup = BeautifulSoup(html, 'lxml')
summary = []
for container in soup.findAll('div', class_='tF2Cxc'):
heading = container.find('h3', class_='LC20lb DKV0Md').text
article_summary = container.find('span', class_='aCOpRe').text
link = container.find('a')['href']
summary.append({
'Heading': heading,
'Article Summary': article_summary,
'Link': link,
})
print(json.dumps(summary, indent=2, ensure_ascii=False))
输出 JSON:
[
{
"Heading": "Java | Oracle",
"Article Summary": "Java Download. » What is Java? » Need Help? » Uninstall. About Java. Go Java Java Training Java + Greenfoot Oracle Code One Oracle Academy for ...",
"Link": "https://www.java.com/"
},
{
"Heading": "Oracle Java Technologies | Oracle",
"Article Summary": "Java Is the Language of Possibilities. Java is powering the innovation behind our digital world. Harness this potential with Java resources for student coders, ...",
"Link": "https://www.oracle.com/java/technologies/"
},
{
"Heading": "Java Software | Oracle",
"Article Summary": "Oracle Java. Java is the #1 programming language and development platform. It reduces costs, shortens development timeframes, drives innovation, and ...",
"Link": "https://www.oracle.com/java/"
},
{
"Heading": "Java - Wikipedia",
"Article Summary": "Java (Indonesian: Jawa, Indonesian pronunciation: [ˈdʒawa]; Javanese: ꦗꦮ; Sundanese: ᮏᮝ) is an island of Indonesia, bordered by the Indian Ocean to the ...",
"Link": "https://en.wikipedia.org/wiki/Java"
},
{
"Heading": "Java (programming language) - Wikipedia",
"Article Summary": "Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible. It is a general-purpose ...",
"Link": "https://en.wikipedia.org/wiki/Java_(programming_language)"
},
{
"Heading": "JDK Builds from Oracle - Java.net",
"Article Summary": "Java Development Kit builds, from Oracle. Ready for use: JDK 16, JDK 15, JMC 8. Early access: JDK 17, Lanai, Loom, Metropolis, Panama, & Valhalla.",
"Link": "https://jdk.java.net/"
},
{
"Heading": "Java Tutorial - Tutorialspoint",
"Article Summary": "Java is a high-level programming language originally developed by Sun Microsystems and released in 1995. Java runs on a variety of platforms, such as ...",
"Link": "https://www.tutorialspoint.com/java/index.htm"
},
{
"Heading": "Google Java Style Guide",
"Article Summary": "This document serves as the complete definition of Google's coding standards for source code in the Java™ Programming Language. A Java source file is ...",
"Link": "https://google.github.io/styleguide/javaguide.html"
}
]
第二种方法使用来自 SerpApi 的 Google Search Engine Results API。这是一个付费 API,可免费试用 5,000 次搜索(如果您需要额外的保护,例如 API 密钥的 api_key
或 os.getenv("API_KEY")
,请确保您使用 os. environ["API_KEY"]
创建了一个文件,{{ 3}} 和其他 documentation):
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "java",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(f"Title: {result['title']}\nLink: {result['link']}\n")
输出:
Title: Java | Oracle
Link: https://www.java.com/
Title: Oracle Java Technologies | Oracle
Link: https://www.oracle.com/java/technologies/
Title: Java Software | Oracle
Link: https://www.oracle.com/java/
Title: Java - Wikipedia
Link: https://en.wikipedia.org/wiki/Java
Title: Java (programming language) - Wikipedia
Link: https://en.wikipedia.org/wiki/Java_(programming_language)
Title: JDK Builds from Oracle - Java.net
Link: https://jdk.java.net/
Title: Google Java Style Guide
Link: https://google.github.io/styleguide/javaguide.html
Title: Java Tutorial - Tutorialspoint
Link: https://www.tutorialspoint.com/java/index.htm
还有,有点不要脸的插件:
如何使用 Python 抓取 Google 新闻,我回答了 source。
如何抓取 Google 地图,我回答了 here。
<块引用>免责声明,我为 SerpApi 工作。
答案 2 :(得分:1)
谷歌现在将此方法称为deprecated,您可能需要尝试:
答案 3 :(得分:0)
请参阅文档http://code.google.com/apis/websearch/docs/reference.html#_intro_fonje
您正在寻找启动参数。
没有参数可以在一个响应中获得更多结果,但您可以通过start参数进行迭代。