Question

我需要从Google搜索引擎信息栏中获取文本数据。如果有人使用关键字“ siemens”在Google搜索引擎上进行搜索。一个小的信息栏出现在Google搜索结果的右侧。我想为该信息栏收集一些文本信息。如何使用请求和Beautifulsoup做到这一点。这里有一些我写的代码。

from bs4 import BeautifulSoup as BS
import requests
from googlesearch import search
from googleapiclient.discovery import build

url = 'https://www.google.com/search?ei=j-iKXNDxDMPdwALdwofACg&q='


com = 'siemens'

#for url in search(com, tld='de', lang='de', stop=10):
#    print(url)

response = requests.get(url+com)
soup = BS(response.content, 'html.parser')

红色标记区域是信息栏

Answer 1

您可以使用BeautifuLSoup中的find函数来检索具有给定类名称，id，css选择器，xpath等的所有元素。如果您检查信息栏（右键单击它并提供“检查”），则可以找到该栏的唯一类名或ID。使用它可以从BeautifulSoup解析的整个html中单独过滤信息栏。

在BeautifulSoup中检出find（）和findall（）以实现输出。因为每个id都是html元素唯一的，所以总是首先要通过id查找。如果没有ID，请选择其他选项。

要获取网址，请在[]中使用google.com/search?q= []进行搜索。对于一个以上单词的查询，请在中间使用“ +”

Answer 2

确保您使用 user-agent 来伪造真实的用户访问，否则可能会导致来自 Google 的请求被阻止。 List 个用户代理。

要从页面中直观地选择元素，您可以使用 SelectorGadgets Chrome 扩展程序来获取 CSS 选择器。

代码和example in online IDE：

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

response = requests.get('https://www.google.com/search?q=simens', headers=headers).text
soup = BeautifulSoup(response, 'lxml')

title = soup.select_one('.SPZz6b h2').text
subtitle = soup.select_one('.wwUB2c span').text
website = soup.select_one('.ellip .ellip').text
snippet = soup.select_one('.Uo8X3b+ span').text
print(f'{title}\n{subtitle}\n{website}\n{snippet}')

输出：

Siemens
Automation company
siemens.com
Siemens AG is a German multinational conglomerate company headquartered in Munich and the largest industrial manufacturing company in Europe with branch offices abroad.

或者，您可以使用来自 SerpApi 的 Google Search Engine Results API。这是一个付费 API，可免费试用 5,000 次搜索。

要集成的代码：

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "simens",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

title = results["knowledge_graph"]["title"]
subtitle = results["knowledge_graph"]["type"]
website = results["knowledge_graph"]["website"]
snippet = results["knowledge_graph"]["description"]
print(f'{title}\n{subtitle}\n{website}\n{snippet}')

输出：

Siemens
Automation company
http://www.siemens.com/
Siemens AG is a German multinational conglomerate company headquartered in Munich and the largest industrial manufacturing company in Europe with branch offices abroad.

<块引用>

免责声明，我在 SerpApi 工作。

如何从Google搜索信息栏中获取文本数据

2 个答案: