beautifulsoup提取没有标签的文字

时间:2018-06-11 14:14:53

标签: python beautifulsoup python-requests

我有如下所示的HTML解析文本,并试图以相同的顺序提取文本。

<b>
 <i>
  Data
 </i>
 Data Summary
</b>
<br/>
Data Description
<br/>
<br/>
<pre>Data paragraph which contains huge string<br/></pre>
<br/>
<br/>
<pre></pre>
<br/>
<br/>
<b>
 <i>
  Data 2
 </i>
 Data 2 Summary
</b>
<br/>
Data 2 Description
<br/>
<br/>
<pre>Data 2 paragraph which contains huge string<br/></pre>
<br/>
<br/>

我可以使用i在标记bsoup.findAll(['b', 'i'])之间进行提取,但我很难在每个b标记之后获取没有标记的文本。我尝试过使用next_sibling,但它甚至不能使用它。任何帮助,将不胜感激。

预期结果是:

Data Summary : Data Description : Data paragraph which contains huge string newline Data 2 : Data 2 Summary : Data 2 Description : Data 2 paragraph which contains huge string

1 个答案:

答案 0 :(得分:1)

您可以迭代所有包含文本的元素,如下所示:

from bs4 import BeautifulSoup

html = """
<b><i>Data</i>Data Summary</b><br/>
Data Description<br/>
<br/>
<pre>Data paragraph which contains huge string<br/></pre>
<br/>
<br/>
<pre></pre>
<br/>
<br/>

<b><i>Data 2</i>Data 2 Summary</b><br/>
Data 2 Description<br/>
<br/>
<pre>Data 2 paragraph which contains huge string<br/></pre>
<br/>
<br/>"""

soup = BeautifulSoup(html, "html.parser")
text_items = [t.strip() for t in soup.find_all(text=True) if len(t.strip())]
print(text_items)

这也会删除任何空格,并且只存储导致非空字符串的项目。它将显示以下列表:

['Data', 'Data Summary', 'Data Description', 'Data paragraph which contains huge string', 'Data 2', 'Data 2 Summary', 'Data 2 Description', 'Data 2 paragraph which contains huge string']