如何在python中使用beautifulsoup在标签之间提取文本

时间:2017-02-05 11:31:16

标签: python beautifulsoup

我正在尝试从以下html结构中提取文本:

<div class= "story-body story-content">
 <p>
  <br>
  "the text I want to get"
  <a href= "http://...>
  <br>
  "the text I want to get"
  <a href="http:// ... >
  .
  .

我已经提取了超链接,但我也不知道如何提取文本。到目前为止我试过了:

names = []
for div in soup3.find_all("div", attrs={"class" : "story-body story-content"}):
    for t in div.find_all('br'):
        t = t.get_text()
        names.append(t)

但我只得到:

[<br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'']

2 个答案:

答案 0 :(得分:2)

for div in soup3.find_all("div", attrs={"class" : "story-body story-content"}):
    text_list = [text for text in div.stripped_strings]

使用stripped_string获取标记

下的所有非空字符串

<br>标记会插入一个换行符。它不包含任何文字。

答案 1 :(得分:0)

html = """
<div class= "story-body story-content">
<p>
<br>
"the text I want to get"
<a href= "http://...>
<br>
"the text I want to get"
<a href="http:// ... >
"""
s = BeautifulSoup(html, 'html.parser')
s.br.nextSibling

将返回:

'\n  "the text I want to get"\n  '

或:

s.br.nextSibling.strip()