Question

我有这个html块如下：

html = 
'''
<div class="details">Mol Cell Biol. 
  <span class="citation-publication-date">2001 Dec; </span>
  21(24): 8471–8482.
  <span class="doi">doi: 10.1128/MCB.21.24.8471-8482.2001</span>
</div>
'''

我希望只使用BeautifulSoup获取<span> Mol Cell Biol.和21(24): 8471–8482.之外的文字。如果我s.text，则会在<span>标记中提供包含文字的所有文字：

s = BeautifulSoup(html, 'html.parser')
s.text

如果我可以在这个特定情况下单独抓取Mol Cell Biol.和21(24): 8471–8482.，即返回["Mol Cell Biol.", "21(24): 8471–8482."]

列表，那会更好

Answer 1

您可以迭代文字，只包含没有span标记的文字

[text for text in s.find_all(text=True) if text.parent.name != "span"]

输出：

[u'Mol Cell Biol. ', u'21(24): 8471\u20138482.']

Answer 2

有一种更简单的方法 - 使用find_all(text=True)，但也使用recursive=False标记，它只会为您带来顶级文字：

details = s.select_one(".details")
data = details.find_all(text=True, recursive=False)

演示（有一些后期处理）：

>>> from bs4 import BeautifulSoup
>>> 
>>> html = '''
... <div class="details">Mol Cell Biol.
...   <span class="citation-publication-date">2001 Dec; </span>
...   21(24): 8471–8482.
...   <span class="doi">doi: 10.1128/MCB.21.24.8471-8482.2001</span>
... </div>
... '''
>>> 
>>> soup = BeautifulSoup(html, "html.parser")
>>> 
>>> details = soup.select_one(".details")
>>> data = details.find_all(text=True, recursive=False)
>>> data = [item.strip() for item in data]
>>> data = [item for item in data if item]
>>> print(data)
[u'Mol Cell Biol.', u'21(24): 8471\u20138482.']

BeautifulSoup获取给定元素之外的文本

2 个答案: