如何使用BeautifulSoup解析没有class / id的项目

时间:2014-05-13 07:33:53

标签: python-2.7 beautifulsoup

我有像这样的HTML代码,

<div style="width:200px">
    <h2> My name1 </h2>
     DOB:17-6-1991
    <br>
    person details, person details,person details
    <div></div>
    <h2> My name2</h2>
     DOB:18-6-1991
    <br>
    person details, person details,person details
    <div></div>
    <h2> My name3 </h2>
     DOB:19-6-1991
    <br>
    person details, person details,person details
    <div></div>
    <h2> My name4 </h2>
     DOB:20-6-1991
    <br>
    person details, person details,person details
    <div></div>
    <h2> My name5 </h2>
     DOB:21-6-1991
    <br>
    person details, person details,person details
    <div></div>
</div>        

我正在使用python BeautifulSoup来解析html代码。在上面的代码中我想要这样的内容,

My name1
17-6-1991
person details, person details,person details

My name2
18-6-1991
person details, person details,person details
.
.
.
.
so on

请帮我解决这个问题

1 个答案:

答案 0 :(得分:1)

有很多方法可以解决您的问题。我选择在循环中迭代h2元素,然后在另一个循环中遍历兄弟姐妹。当我遇到另一个h2时,我突破了内循环。我没有删除空格。您可以使用rtrimltrim等Python方法来实现。您可以使用string.replace删除“DOB:”。

from bs4 import BeautifulSoup
from bs4 import NavigableString

s = """your HTML here"""

soup = BeautifulSoup(s)
headers = soup.find_all("h2")
for h in headers:
   print h.text
   for s in h.next_siblings:
      if s.name == "h2":
         break
      elif isinstance(s, NavigableString):
         print s.string