如何使用beautifulsoup解析文本结构?

时间:2018-02-19 20:53:19

标签: python html python-3.x beautifulsoup

我想用beautifulsoup解析python中的一些信息。 这是源HTML:

<dt class="time">21:07</dt>
<dd class="mix">
  <ul class="item">
    <li class="title">
      <span>John</span>
      <span>Room 1</span>
    </li>
  </ul>
</dd>
<dt class="time">21:10</dt>
<dd class="mix">
  <ul class="item">
    <li class="title">
      <span>Susi</span>
      <span>Room 2</span>
    </li>
  </ul>
</dd>
....

我想以下列方式输出所有内容:

21:07 John Room 1
21:10 Susi Room 2

我到现在为止尝试了什么:

page = urlopen(html_page)
soup = BeautifulSoup(page, 'lxml')
soup.prettify()
times = soup.find_all('dt', {'class': 'time'})
roominfos = soup.find_all('li', {'class': 'title'})
for time in times:
    print(time.text)
for roominfo in roominfos:
    print(roominfo.text)

我只能分别得到时间项和房间信息,但不是并排。我怎么能这样做?

感谢。

1 个答案:

答案 0 :(得分:0)

你可以试试这个:

from bs4 import BeautifulSoup as soup
import re
s = """
<dt class="time">21:07</dt>
<dd class="mix">
<ul class="item">
<li class="title">
<span>John</span>
<span>Room 1</span>
</li>
</ul>
</dd>
<dt class="time">21:10</dt>
<dd class="mix">
<ul class="item">
<li class="title">
<span>Susi</span>
<span>Room 2</span>
</li>
</ul>
</dd>
"""
data = soup(s, 'lxml')
final_data = [i.text for i in data.find_all(re.compile('dt|span'))]
new_final_data = [final_data[i:i+3] for i in range(0, len(final_data), 3)]

输出:

[[u'21:07', u'John', u'Room 1'], [u'21:10', u'Susi', u'Room 2']]