我想用beautifulsoup解析python中的一些信息。 这是源HTML:
<dt class="time">21:07</dt>
<dd class="mix">
<ul class="item">
<li class="title">
<span>John</span>
<span>Room 1</span>
</li>
</ul>
</dd>
<dt class="time">21:10</dt>
<dd class="mix">
<ul class="item">
<li class="title">
<span>Susi</span>
<span>Room 2</span>
</li>
</ul>
</dd>
....
我想以下列方式输出所有内容:
21:07 John Room 1
21:10 Susi Room 2
我到现在为止尝试了什么:
page = urlopen(html_page)
soup = BeautifulSoup(page, 'lxml')
soup.prettify()
times = soup.find_all('dt', {'class': 'time'})
roominfos = soup.find_all('li', {'class': 'title'})
for time in times:
print(time.text)
for roominfo in roominfos:
print(roominfo.text)
我只能分别得到时间项和房间信息,但不是并排。我怎么能这样做?
感谢。
答案 0 :(得分:0)
你可以试试这个:
from bs4 import BeautifulSoup as soup
import re
s = """
<dt class="time">21:07</dt>
<dd class="mix">
<ul class="item">
<li class="title">
<span>John</span>
<span>Room 1</span>
</li>
</ul>
</dd>
<dt class="time">21:10</dt>
<dd class="mix">
<ul class="item">
<li class="title">
<span>Susi</span>
<span>Room 2</span>
</li>
</ul>
</dd>
"""
data = soup(s, 'lxml')
final_data = [i.text for i in data.find_all(re.compile('dt|span'))]
new_final_data = [final_data[i:i+3] for i in range(0, len(final_data), 3)]
输出:
[[u'21:07', u'John', u'Room 1'], [u'21:10', u'Susi', u'Room 2']]