Question

鉴于下面的HTML代码，我想要输出h1的文本，而不是输出“Details about”，这是span的文本（由h1封装）。

我目前的输出结果为：

Details about   New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

我想：

New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

以下是我正在使用的HTML

<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>

这是我目前的代码：

for line in soup.find_all('h1',attrs={'itemprop':'name'}):
    print line.get_text()

注意：我不想截断字符串，因为我希望此代码具有一些可重用性。什么是最好的是一些代码，用于裁剪任何由跨度限制的文本。

Answer 1

您可以使用extract()删除所有span代码：

for line in soup.find_all('h1',attrs={'itemprop':'name'}):
    [s.extract() for s in line('span')]
print line.get_text()
# => New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

Answer 2

一种解决方案是检查字符串是否包含html：

from bs4 import BeautifulSoup

html = """<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>"""
soup = BeautifulSoup(html, 'html.parser')

for line in soup.find_all('h1', attrs={'itemprop': 'name'}):
    for content in line.contents:
        if bool(BeautifulSoup(str(content), "html.parser").find()):
            continue

        print content

另一种解决方案（我更喜欢）是检查bs4.element.Tag的实例：

import bs4

html = """<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>"""
soup = bs4.BeautifulSoup(html, 'html.parser')

for line in soup.find_all('h1', attrs={'itemprop': 'name'}):
    for content in line.contents:
        if isinstance(content, bs4.element.Tag):
            continue

        print content

beautifulsoup .get_text（）对我的HTML解析不够具体

2 个答案: