如何使用BeautifulSoup提取网页“关于我们”下的文本

时间:2019-08-03 18:13:18

标签: html xml beautifulsoup

我是webscrapping的新手,我不确定如何从网页的“ 关于我们”下提取文本。

在不同网页中,“ 关于我们”标头的类有所不同。

您能指导我还是提供代码以提取https://github.com/chocolatey/choco/issues/50之类的网页中“关于我们”下的文本

我可以在标题中看到“ 关于我们”,但无法使用此标题提取数据。

for heading in soup.find_all(re.compile("^h[1-6]")):

    print(heading.name + ' ' + heading.text.strip())

谢谢, 奈杜

2 个答案:

答案 0 :(得分:0)

此脚本将选择所有<Hx>标签,其中包含字符串“关于我们”:

import re
import requests
from bs4 import BeautifulSoup

url = 'https://www.thestylistgroup.com/'

soup = BeautifulSoup(requests.get(url).text, 'lxml')

for tag in soup.find_all(lambda t: re.findall(r'h\d+', t.name) and t.text.strip().lower()=='about us'):
    print(tag)
    print(tag.next_sibling.text) # This will get text from the next sibling tag

打印:

<h2 class="css-6r2li">About Us</h2>
The Stylist Group is a leading digital publisher and media platform with pioneering brands Stylist and Emerald Street. Within an inspiring, fast-paced, entrepreneurial environment we create original magazines and digital brands for Stylist Women - our successful, sophisticated, dynamic and urban audience. These people have very little time, a considerable disposable income and no patience with inauthentic attempts to try to engage them. Our purpose is to create content Stylist Women are proud to enjoy.

答案 1 :(得分:0)

假设文本始终是直接的同级,则可以使用以下内容(bs4 4.7.1 +)。请注意,由于立即进行同级假设,可能会导致错误结果。

from bs4 import BeautifulSoup as bs
import requests

r = requests.get('https://www.thestylistgroup.com/')
soup = bs(r.content, 'lxml')
for h in range(1,7):
    header_with_sibling = soup.select('h' + str(h) + ':contains("About Us") + *')
    if header_with_sibling:
        for i in header_with_sibling:
            print(i.text)

如果您想在第一场比赛中停下

from bs4 import BeautifulSoup as bs
import requests

r = requests.get('https://www.thestylistgroup.com/')
soup = bs(r.content, 'lxml')
for h in range(1,7):
    header_with_sibling = soup.select_one('h' + str(h) + ':contains("About Us") + *')
    if header_with_sibling:
        print(header_with_sibling.text)
        break