使用BeautifulSoup获取没有标签的文本

时间:2016-07-17 16:42:51

标签: python beautifulsoup

我试图使用BeautifulSoup获取一些没有标签的文本。我尝试使用 .string .contents .text .find(text = True)和< em> .next_sibling ,它们列在下面。

修改 Nvmd我刚注意到 .next_sibling 对我有用。无论如何,这个问题可以是一个处理类似案例的笔记收集方法。

import bs4 as BeautifulSoup
s = """
<p>
    <a>
        Something I can fetch but don't want
    </a> 
    I want to fetch this line.
    <a>
        Something else I can fetch but don't want
    </a>
</p>
"""

p = BeautifulSoup(s, 'html.parser')
print p.contents            
    # [u'\n', <p>
    # <a>
    #     Something
    # </a> 
    #     I want to fetch this line.
    # <a>
    #     Something else
    # </a>
    # </p>, u'\n']

print p.next_sibling.string 
    # I want to fetch this line.
print p.string              
    # None
print p.text        
    # all the texts, including those I can get but don't want.
print p.find(text=True)
    # Returns an empty line of type bs4.element.NavigableString
print p.find(text=True)[0]
    # Returns an empty line of type unicode

我想知道是否有比手动解析字符串s更简单的方法来获取我想要获取的行?

1 个答案:

答案 0 :(得分:2)

试试这个。它仍然很粗糙,但至少它不需要您手动解析字符串。

#get all non-empty strings from the backend.
texts = [str.strip(x) for x in p.strings if str.strip(x) != '']

#get strings only with tags
unwanted_text = [str.strip(x.text) for x in p.find_all()]

#take the difference
set(texts).difference(unwanted_text)

这会产生:

In [87]: set(texts).difference(unwanted_text)
Out[87]: {'I want to fetch this line.'}