使用具有相同类的子和父div的beautifulSoup解析HTML

时间:2012-02-07 06:29:08

标签: python regex beautifulsoup

我有这个HTML,我必须从中提取数据:

<html>
<head></head>
<body>
<div class="main">
  <div class="utlimate"><p>hello</p></div>
  <div class = "headline"><p>some text</p></div>
   <div class="content">
     <div class = "utimate"> <p>TOP</p>
        <div class ="utlimate"> <p>data1</p></div>
        <div class ="utlimate"> <p>it could be anything</p></div>
        <div class ="utlimate"> <p>not</p></div>
        <div class ="utlimate"> <p></p></div>

     </div>
   </div>
</div>
</body>
</html>

我需要访问<div class="ultimate"> <p>,其值为“data1”,“它可以是任何内容”,“不是”。我为此尝试的代码:

soup = BeautifulSoup(HTML_data)     #HTML_data is all html content
first_div = soup.find('div',{"class" : "content"})
second_div = first_div.find('div',{"class" : "utlimate"})
div_list = second_div.findall('div',{"class" : "utlimate"})

我的代码最后一行出错'NoneType'对象无法调用

我如何只访问那些div的??? plz help

2 个答案:

答案 0 :(得分:2)

试试这个:

soup = BeautifulSoup(HTML_data)     #HTML_data is all html content
first_div = soup.find('div',{"class" : "content"})
second_div = first_div.find('div',{"class" : "utimate"})
div_list = second_div.findAll('div',{"class" : "utlimate"})

获取列表的方法是findAll,而不是findall。 HTML片段中没有“终极”,它们“非常”或“非常”。是那些错别字吗?

答案 1 :(得分:1)

汤没有吗?

我建议你重新考虑你的代码以防止这种情况:

soup = BeautifulSoup(HTML_data)     #HTML_data is all html content
if soup ==None:
    //Error
else:
    c = soup.contents
    // Use RE here