beautifulsoup .get_text()对我的HTML解析不够具体

时间:2015-07-16 18:57:06

标签: python html regex beautifulsoup

鉴于下面的HTML代码,我想要输出h1的文本,而不是输出“Details about”,这是span的文本(由h1封装)。

我目前的输出结果为:

Details about   New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

我想:

New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

以下是我正在使用的HTML

<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>

这是我目前的代码:

for line in soup.find_all('h1',attrs={'itemprop':'name'}):
    print line.get_text()

注意:我不想截断字符串,因为我希望此代码具有一些可重用性。 什么是最好的是一些代码,用于裁剪任何由跨度限制的文本。

2 个答案:

答案 0 :(得分:5)

您可以使用extract()删除所有span代码:

for line in soup.find_all('h1',attrs={'itemprop':'name'}):
    [s.extract() for s in line('span')]
print line.get_text()
# => New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

答案 1 :(得分:0)

一种解决方案是检查字符串是否包含html

from bs4 import BeautifulSoup

html = """<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>"""
soup = BeautifulSoup(html, 'html.parser')

for line in soup.find_all('h1', attrs={'itemprop': 'name'}):
    for content in line.contents:
        if bool(BeautifulSoup(str(content), "html.parser").find()):
            continue

        print content

另一种解决方案(我更喜欢)是检查bs4.element.Tag的实例:

import bs4

html = """<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>"""
soup = bs4.BeautifulSoup(html, 'html.parser')

for line in soup.find_all('h1', attrs={'itemprop': 'name'}):
    for content in line.contents:
        if isinstance(content, bs4.element.Tag):
            continue

        print content