如何使用BeautifulSoup在Python中解析谷歌搜索结果

时间:2017-12-21 16:03:54

标签: python python-3.x beautifulsoup lxml

我正在尝试解析谷歌搜索结果的第一页。具体来说,标题和提供的小摘要。以下是我到目前为止的情况:

{{1}}

我现在停留的部分是沿着HTML路径向下解析我想要的特定数据。到目前为止,我所尝试过的所有内容都只是抛出一个错误,表示它没有属性,或只是返回“[]”。

我是Python和BeautifulSoup的新手,所以我不确定如何到达我想要的地方的语法。我发现这些是页面中的单个搜索结果:

https://ibb.co/jfRakR

任何有关解析每个搜索结果的标题和摘要的内容的帮助都会受到非常感谢。

谢谢!

2 个答案:

答案 0 :(得分:6)

您的网址对我不起作用。但是https://google.com/search?q=我得到了结果。

import urllib
from bs4 import BeautifulSoup
import requests
import webbrowser

text = 'hello world'
text = urllib.parse.quote_plus(text)

url = 'https://google.com/search?q=' + text

response = requests.get(url)

#with open('output.html', 'wb') as f:
#    f.write(response.content)
#webbrowser.open('output.html')

soup = BeautifulSoup(response.text, 'lxml')
for g in soup.find_all(class_='g'):
    print(g.text)
    print('-----')

阅读Beautiful Soup Documentation

答案 1 :(得分:1)

  1. 默认的 Google 搜索地址开头 - 这有点不正确。它不包含 # 符号。相反,它应该有 ?/search pathname
So this ---> https://google.com/#q=
Should be this ---> https://www.google.com/search?q=cake
  1. 您需要 user-agent 才能使其工作,因为默认的 python user-agent"python-requests",站点可以识别它并阻止脚本。查看 Robots.txt 了解更多信息。 这可能是您得到空结果的原因。 Here 你可以找到 user-agents 列表来伪造用户访问。

  2. 您可以使用 SerpApi 中的 Google Organic Results API见最后)。

代码:

from bs4 import BeautifulSoup
import requests
import json

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

html = requests.get('https://www.google.com/search?q=java&oq=java',
                    headers=headers).text

soup = BeautifulSoup(html, 'lxml')

summary = []

for container in soup.findAll('div', class_='tF2Cxc'):
  heading = container.find('h3', class_='LC20lb DKV0Md').text
  article_summary = container.find('span', class_='aCOpRe').text

  summary.append({
      'Heading': heading,
      'Article Summary': article_summary,
  })

print(json.dumps(summary, indent=2, ensure_ascii=False))

输出 JSON:

[
  {
    "Heading": "Java | Oracle",
    "Article Summary": "Java+You, Download Today! Java Download. » What is Java? » Need Help? » Uninstall. About Java. Go Java Java Training Java + Greenfoot Oracle Code One Oracle Academy for ..."
  },
  {
    "Heading": "Oracle Java Technologies | Oracle",
    "Article Summary": "Java Is the Language of Possibilities. Java is powering the innovation behind our digital world. Harness this potential with Java resources for student coders, ..."
  },
  {
    "Heading": "Java Software | Oracle",
    "Article Summary": "includes GraalVM Enterprise at no additional cost. Download Java now · Get support. Products. Oracle Java SE Subscription · Oracle JDK · Oracle OpenJDK · Oracle Java SE Platform ..."
  },
  {
    "Heading": "Java (programming language) - Wikipedia",
    "Article Summary": "Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible. It is a general-purpose ..."
  },
  {
    "Heading": "Java - Wikipedia",
    "Article Summary": "Java (Indonesian: Jawa, Indonesian pronunciation: [ˈdʒawa]; Javanese: ꦗꦮ; Sundanese: ᮏᮝ) is one of the islands of the Greater Sunda Islands in Indonesia, ..."
  },
  {
    "Heading": "Google LLC v. Oracle America, Inc. - Supreme Court",
    "Article Summary": "2 days ago — the Java programming language to work with its new Android plat- form, Google copied roughly 11,500 lines of code from the Java SE pro-."
  },
  {
    "Heading": "OpenJDK - Java.net",
    "Article Summary": "ZGC. Tools. Mercurial · Git · jtreg harness. Related. java.sun.com · Java Community Process · JDK GA/EA Builds · Oracle logo. © 2021 Oracle Corporation and/or its affiliates. Terms of ..."
  }
]

使用 SerpApi

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "java",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
   print(f"Title: {result['title']}\nSummary: {result['snippet']}\n")

输出:

Title: Java | Oracle
Summary: Java Download. » What is Java? » Need Help? » Uninstall. About Java. Go Java Java Training Java + Greenfoot Oracle Code One Oracle Academy for ...

Title: Oracle Java Technologies | Oracle
Summary: Java Is the Language of Possibilities. Java is powering the innovation behind our digital world. Harness this potential with Java resources for student coders, ...

Title: Java SE - Downloads | Oracle Technology Network | Oracle
Summary: Java SE downloads including: Java Development Kit (JDK), Server Java Runtime Environment (Server JRE), and Java Runtime Environment (JRE).

Title: Java (programming language) - Wikipedia
Summary: Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible. It is a general-purpose ...

Title: Java - Wikipedia
Summary: Java (Indonesian: Jawa, Indonesian pronunciation: [ˈdʒawa]; Javanese: ꦗꦮ; Sundanese: ᮏᮝ) is one of the islands of the Greater Sunda Islands in Indonesia, ...

Title: OpenJDK - Java.net
Summary: What is this? The place to collaborate on an open-source implementation of the Java Platform, Standard Edition, and related projects. (Learn more.).

Title: Java Resources for Students, Hobbyists and More | go.Java ...
Summary: Java Powers Our Digital World. Java is at the heart of our digital lifestyle. It's the platform for launching careers, exploring human-to-digital interfaces, architecting ...

确保您使用您的 Environment variable

创建了一个 api_key 文件 <块引用>

免责声明,我为 SerpApi 工作。