抓取谷歌搜索结果时提取特定段落

时间:2021-06-28 21:50:47

标签: python beautifulsoup python-requests screen-scraping google-search

我目前正在从事网络抓取工作,我需要在谷歌搜索结果中提取城市的描述。

假设我想要马德里市的描述,我搜索并得到以下结果:

I need to extract the highlighted text

这是目标div的源代码:

<div jscontroller="GCSbhd" class="kno-rdesc" jsaction="seM7Qe:c0XUbe;Iigoee:c0XUbe;rcuQ6b:npT2md">
    <h3 class="Uo8X3b OhScic zsYMMe">Description</h3>
    <span>Située au centre de l'Espagne, Madrid, sa capitale, est une ville dotée d'élégants boulevards et de vastes parcs très bien entretenus comme le Retiro. Elle est réputée pour ses riches collections d'œuvres d'art européennes, avec notamment celles du musée du Prado, réalisées par Goya, Velázquez et d'autres maîtres espagnols. Au cœur de la vieille Madrid des Habsbourgs se trouve la Plaza&nbsp;Mayor, bordée de portiques, et, à proximité, le Palais royal baroque et son Armurerie, qui comporte des armes historiques.
        <span>
            <span class="eHaQD"> ―&nbsp;Google
            </span>
        </span>
    </span>
</div>

我尝试抓取内容并选择 <h3> 标签,然后选择其兄弟,但结果是 None,这是使用的代码:

import requests
from bs4 import BeautifulSoup
url_PresMadrid = "https://www.google.com/search?q=madrid"
req_PresPadrid = requests.get(url_PresMadrid)
soup_PresMadrid = BeautifulSoup(req_PresPadrid.content, 'html.parser')
target_div_PresMadrid = soup_PresMadrid.find('h3', {'class': 'Uo8X3b OhScic zsYMMe'})
print(target_div_PresMadrid)

我什至试图选择唯一不改变其类但代码返回 <div> 的父 None,这是它的代码:

import requests
from bs4 import BeautifulSoup
url_PresMadrid = "https://www.google.com/search?q=madrid"
req_PresPadrid = requests.get(url_PresMadrid)
soup_PresMadrid = BeautifulSoup(req_PresPadrid.content, 'html.parser')
target_div_PresMadrid = soup_PresMadrid.find('div', {'class': 'liYKde g VjDLd'})
print(target_div_PresMadrid)

谁能帮我了解搜索引擎的机制,以便我可以提取该段落

1 个答案:

答案 0 :(得分:1)

如果您在浏览器中禁用 JavaScript,您会看到您想要的段落实际上位于 BNeawe s3v9rd AP7Wnd 类下:

<div class="BNeawe s3v9rd AP7Wnd">
 Madrid, Spain's central capital, is a city of elegant boulevards and expansive, manicured parks such as the Buen Retiro. It’s renowned for its rich repositories of European art, including the Prado Museum’s works by Goya, Velázquez and other Spanish masters. The heart of old Hapsburg Madrid is the portico-lined Plaza Mayor, and nearby is the baroque Royal Palace and Armory, displaying historic weaponry.
</div>

requests 库不支持 JavaScript。因此,您需要访问此类 BNeawe s3v9rd AP7Wnd

虽然有多个类同名,但由于 find() 只返回第一个匹配,你可以使用它

import requests
from bs4 import BeautifulSoup


url_PresMadrid = "https://www.google.com/search?q=madrid"
req_PresPadrid = requests.get(url_PresMadrid)
soup_PresMadrid = BeautifulSoup(req_PresPadrid.content, "html.parser")
target_div_PresMadrid = soup_PresMadrid.find("div", {"class": "BNeawe s3v9rd AP7Wnd"})
print(target_div_PresMadrid.text)

输出:

Madrid, Spain's central capital, is a city of elegant boulevards and expansive, manicured parks such as the Buen Retiro. It’s renowned for its rich repositories of European art, including the Prado Museum’s works by Goya, Velázquez and other Spanish masters. The heart of old Hapsburg Madrid is the portico-lined Plaza Mayor, and nearby is the baroque Royal Palace and Armory, displaying historic weaponry.

另见: