.....
<div class="day"><div class="content">Idag<span id='updatedby'>, by <b>Karl</b> (100) </span></div></div>
<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon- Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text1 </div></a><br /> <div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a href="?p=komment&id=xxxxx">18 comments</a></div></div>
<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon-Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text2 </div></a><br /> <div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a href="?p=komment&id=xxxxx">18 comments</a></div></div>
<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon-Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text3 </div></a><br /> <div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a href="?p=komment&id=xxxxx">18 comments</a></div></div>
<div class="day"><div class="content">2014-01-14<span id='updatedby'>, by<b>Person</b> (50)</span></div></div>
""""**DO NOT PRINT THIS**""""
<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon-Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text4 </div></a><br /> <div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a href="?p=komment&id=xxxxx">18 comments</a></div></div>
""""**DO NOT PRINT THIS**""""
....
从那个html我想在第一个div与class =“day”之间提取所有内容到下一个div with class =“day”
输出应为:
<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon- Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text1 </div></a><br /><div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a href="?p=komment&id=xxxxx">18 comments</a></div></div>
<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon-Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text2 </div></a><br /><div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a href="?p=komment&id=xxxxx">18 comments</a></div></div>
<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon-Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text3 </div></a><br /><div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a href="?p=komment&id=xxxxx">18 comments</a></div></div>
我目前的代码如下:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('text.html'))
contain = []
contain = soup.find_all('div',{'class':'day'})
del contain[2::]
print (contain)
使用此代码,我得到的输出是:
[<div class="day"><div class="content">Idag<span id="updatedby">, by<b>Karl</b> (100)</span></div></div>, <div class="day"><div class="content">2014-01-14<span id="updatedby">, by <b>Person</b> (50)</span></div></div>]
答案 0 :(得分:1)
你可以这样做:
from bs4 import BeautifulSoup
data = '''
<div class="day"><div class="content">Idag<span id='updatedby'>, by <b>Karl</b> (100) </span></div></div>
<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon- Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text1 </div></a><br /> <div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a href="?p=komment&id=xxxxx">18 comments</a></div></div>
<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon-Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text2 </div></a><br /> <div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a href="?p=komment&id=xxxxx">18 comments</a></div></div>
<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon-Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text3 </div></a><br /> <div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a href="?p=komment&id=xxxxx">18 comments</a></div></div>
<div class="day"><div class="content">2014-01-14<span id='updatedby'>, by<b>Person</b> (50)</span></div></div>
<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon-Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text4 </div></a><br /> <div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a href="?p=komment&id=xxxxx">18 comments</a></div></div>
'''
soup = BeautifulSoup(data)
result = []
tag = soup.find_all('div', {'class': 'day'})[0]
while True:
tag = tag.next_sibling
if hasattr(tag, 'class') and 'day' in tag['class']:
break
result.append(tag)
for e in result:
print(e)
结果:
<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img alt="Hemsida" class="type" src="img/ikon- Hemsida.gif"/><div class="text"> Sample text1 </div></a><br/> <div class="sbar"><img alt="Kommentarer" class="comment" src="img/comment.gif"/> <a href="?p=komment&id=xxxxx">18 comments</a></div></div>
<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img alt="Hemsida" class="type" src="img/ikon-Hemsida.gif"/><div class="text"> Sample text2 </div></a><br/> <div class="sbar"><img alt="Kommentarer" class="comment" src="img/comment.gif"/> <a href="?p=komment&id=xxxxx">18 comments</a></div></div>
<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img alt="Hemsida" class="type" src="img/ikon-Hemsida.gif"/><div class="text"> Sample text3 </div></a><br/> <div class="sbar"><img alt="Kommentarer" class="comment" src="img/comment.gif"/> <a href="?p=komment&id=xxxxx">18 comments</a></div></div>
此代码假定您将处理一堆兄弟节点(无嵌套)。它从第一个class="day"
div开始,然后逐步执行兄弟节点并将它们附加到结果列表,直到它到达下一个class="day"
div,此时它break
出来。