我试图使用BeautifulSoup获取一些没有标签的文本。我尝试使用 .string , .contents , .text , .find(text = True)和< em> .next_sibling ,它们列在下面。
修改 Nvmd我刚注意到 .next_sibling 对我有用。无论如何,这个问题可以是一个处理类似案例的笔记收集方法。
import bs4 as BeautifulSoup
s = """
<p>
<a>
Something I can fetch but don't want
</a>
I want to fetch this line.
<a>
Something else I can fetch but don't want
</a>
</p>
"""
p = BeautifulSoup(s, 'html.parser')
print p.contents
# [u'\n', <p>
# <a>
# Something
# </a>
# I want to fetch this line.
# <a>
# Something else
# </a>
# </p>, u'\n']
print p.next_sibling.string
# I want to fetch this line.
print p.string
# None
print p.text
# all the texts, including those I can get but don't want.
print p.find(text=True)
# Returns an empty line of type bs4.element.NavigableString
print p.find(text=True)[0]
# Returns an empty line of type unicode
我想知道是否有比手动解析字符串s更简单的方法来获取我想要获取的行?
答案 0 :(得分:2)
试试这个。它仍然很粗糙,但至少它不需要您手动解析字符串。
#get all non-empty strings from the backend.
texts = [str.strip(x) for x in p.strings if str.strip(x) != '']
#get strings only with tags
unwanted_text = [str.strip(x.text) for x in p.find_all()]
#take the difference
set(texts).difference(unwanted_text)
这会产生:
In [87]: set(texts).difference(unwanted_text)
Out[87]: {'I want to fetch this line.'}