BeautifulSoup解析非结构化的HTML

时间:2015-06-27 11:40:16

标签: python beautifulsoup html-parsing

尝试使用BeautifulSoup解析此html:

<div class="container">
  <strong>Monday</strong> Some info here...<br /> and then some <br />
  <strong>Tuesday</strong> Some info here...<br />
  <strong>Wednesday</strong> Some info here...<br />
  ...
</div>

我希望能够仅在星期二获取数据:<strong>Tuesday</strong> Some info here...<br /> 但由于没有包装div,我很难获得这些数据。有什么建议?

1 个答案:

答案 0 :(得分:3)

这样怎么样:

from bs4 import BeautifulSoup

html = """<div class="container">
  <strong>Monday</strong> Some info here...<br /> and then some <br />
  <strong>Tuesday</strong> Some info here...<br />
  <strong>Wednesday</strong> Some info here...<br />
  ...
</div>"""
soup = BeautifulSoup(html)
result = soup.find('strong', text='Tuesday').findNextSibling(text=True)
print(result.decode('utf-8'))

输出

 Some info here...

根据评论进行更新:

基本上,您可以继续获取<strong>Tuesday</strong>的下一个兄弟文本,直到文本的下一个兄弟元素是另一个<strong>元素或none

from bs4 import BeautifulSoup

html = """<div class="container">
  <strong>Monday</strong> Some info here...<br /> and then some <br />
  <strong>Tuesday</strong> Some info here...<br /> and then some <br />
  <strong>Wednesday</strong> Some info here...<br />
  ...
</div>"""
soup = BeautifulSoup(html)
result = soup.find('strong', text='Tuesday').findNextSibling(text=True)
nextSibling = result.findNextSibling()
while nextSibling and nextSibling.name != 'strong':
    print(result.decode('utf-8'))
    result = nextSibling.findNextSibling(text=True)
    nextSibling = result.findNextSibling()

输出

 Some info here...
 and then some