python 3.3两组课之间的Beautifulsoup文本

时间:2014-01-15 11:38:56

标签: python beautifulsoup

.....
<div class="day"><div class="content">Idag<span id='updatedby'>, by <b>Karl</b> (100)     </span></div></div>

<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon-   Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text1 </div></a><br />   <div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a    href="?p=komment&id=xxxxx">18 comments</a></div></div>

<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon-Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text2 </div></a><br /> <div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a href="?p=komment&id=xxxxx">18 comments</a></div></div>

<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon-Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text3 </div></a><br />  <div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a   href="?p=komment&id=xxxxx">18 comments</a></div></div>

<div class="day"><div class="content">2014-01-14<span id='updatedby'>, by<b>Person</b>  (50)</span></div></div>

""""**DO NOT PRINT THIS**""""
<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon-Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text4 </div></a><br /> <div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a href="?p=komment&id=xxxxx">18 comments</a></div></div> 
""""**DO NOT PRINT THIS**""""
....

从那个html我想在第一个div与class =“day”之间提取所有内容到下一个div with class =“day”

输出应为:

<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon- Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text1 </div></a><br /><div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a href="?p=komment&id=xxxxx">18 comments</a></div></div>

<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon-Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text2 </div></a><br /><div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a href="?p=komment&id=xxxxx">18 comments</a></div></div>

<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon-Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text3 </div></a><br /><div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a href="?p=komment&id=xxxxx">18 comments</a></div></div>

我目前的代码如下:

from bs4 import BeautifulSoup 
soup = BeautifulSoup(open('text.html'))
contain = []
contain = soup.find_all('div',{'class':'day'})
del contain[2::]
print (contain)

使用此代码,我得到的输出是:

[<div class="day"><div class="content">Idag<span id="updatedby">, by<b>Karl</b> (100)</span></div></div>, <div class="day"><div class="content">2014-01-14<span id="updatedby">, by <b>Person</b> (50)</span></div></div>]

1 个答案:

答案 0 :(得分:1)

你可以这样做:

from bs4 import BeautifulSoup

data = '''
<div class="day"><div class="content">Idag<span id='updatedby'>, by <b>Karl</b> (100)     </span></div></div>
<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon-   Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text1 </div></a><br />   <div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a    href="?p=komment&id=xxxxx">18 comments</a></div></div>
<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon-Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text2 </div></a><br /> <div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a href="?p=komment&id=xxxxx">18 comments</a></div></div>
<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon-Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text3 </div></a><br />  <div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a   href="?p=komment&id=xxxxx">18 comments</a></div></div>
<div class="day"><div class="content">2014-01-14<span id='updatedby'>, by<b>Person</b>  (50)</span></div></div>
<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon-Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text4 </div></a><br /> <div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a href="?p=komment&id=xxxxx">18 comments</a></div></div> 
'''
soup = BeautifulSoup(data)

result = []
tag = soup.find_all('div', {'class': 'day'})[0]
while True:
    tag = tag.next_sibling
    if hasattr(tag, 'class') and 'day' in tag['class']:
        break
    result.append(tag)
for e in result:
    print(e)

结果:

<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img alt="Hemsida" class="type" src="img/ikon-   Hemsida.gif"/><div class="text"> Sample text1 </div></a><br/> <div class="sbar"><img alt="Kommentarer" class="comment" src="img/comment.gif"/> <a href="?p=komment&amp;id=xxxxx">18 comments</a></div></div>


<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img alt="Hemsida" class="type" src="img/ikon-Hemsida.gif"/><div class="text"> Sample text2 </div></a><br/> <div class="sbar"><img alt="Kommentarer" class="comment" src="img/comment.gif"/> <a href="?p=komment&amp;id=xxxxx">18 comments</a></div></div>


<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img alt="Hemsida" class="type" src="img/ikon-Hemsida.gif"/><div class="text"> Sample text3 </div></a><br/> <div class="sbar"><img alt="Kommentarer" class="comment" src="img/comment.gif"/> <a href="?p=komment&amp;id=xxxxx">18 comments</a></div></div>

此代码假定您将处理一堆兄弟节点(无嵌套)。它从第一个class="day" div开始,然后逐步执行兄弟节点并将它们附加到结果列表,直到它到达下一个class="day" div,此时它break出来。