在beautifulsoup中解析一个未命名的元素

时间:2015-02-10 13:53:26

标签: python-2.7 parsing beautifulsoup lxml

我需要解析未命名的br元素之间的某些文本,以获取某个类名称的span。在这个例子中,我需要0.36,它就在这个例子中的命名属性“DS”之后。

这是我尝试过的。

from bs4 import BeautifulSoup  
html="""
<pre5 style="">                                                    
    <br><br>
    <span class="field-name">DS :</span>                               
    0.36 [null]<br><br> <br> <span> <b>FC</b> </span><span> : 0.0 </span><br>  <br> <span> <b>FDC</b> </span><span> : 0.36 </span><br>  <br> <span> <b>LDD</b> </span><span> : 4838400000 </span><br>  <br> <span> <b>IFS</b> </span><span> : 0.5333333 </span><br> 
  </pre5>
"""
soup = BeautifulSoup(html,'lxml')
divTag = soup.find_all("pre5", {"style":""})

for tag in divTag:
    tdTags = tag.find_all("span", {"class":"field-name"})
    for tag in tdTags:
        print tag.text 
        # print DS :, but I want 0.36


#Alternatively,
soup = BeautifulSoup(html,'lxml')
print str(soup.span.next_sibling.strip()).replace('[null]','')
#prints 0.36 , but I would like to print by making sure that this element actually comes along with DS: and not just by the "immediate next sibilng" - is there a way to respect the named attribute DS and fetch the value for it ? 

同样通过字符串解析/拆分/替换它会更慢,我可以直接使用树结构吗?

修改,在这种情况下,DS的值应为0.007。不能保证DS将成为span类中的第一个元素。

html="""
<pre5 style="">                                                    
    <br><br>
    <span class="field-name">FC :</span>                               
    0.36 [null]<br><br> <br> <span> <b>DS:</b> </span><span> : 0.007 </span><br>  <br> <span> <b>FDC</b> </span><span> : 0.36 </span><br>  <br> <span> <b>LDD</b> </span><span> : 4838400000 </span><br>  <br> <span> <b>IFS</b> </span><span> : 0.5333333 </span><br> 
  </pre5>
"""

2 个答案:

答案 0 :(得分:2)

由于文字DS可以位于<span><b>标记内,因此数据也可以位于<span>标记内,您可以搜索标记像这样:

html = """
<pre5 style="">
    <br><br>
    <span class="field-name">DS :</span>
    0.36 [null]<br><br> <br> <span> <b>FC</b> </span><span> : 0.0 </span><br>  <br> <span> <b>FDC</b> </span><span> : 0.36 </span><br>  <br> <span> <b>LDD</b> </span><span> : 4838400000 </span><br>  <br> <span> <b>IFS</b> </span><span> : 0.5333333 </span><br>
  </pre5>
<pre5 style="">
    <br><br>
    <span class="field-name">FC :</span>
    0.36 [null]<br><br> <br> <span> <b>DS:</b> </span><span> : 0.007 </span><br>  <br> <span> <b>FDC</b> </span><span> : 0.36 </span><br>  <br> <span> <b>LDD</b> </span><span> : 4838400000 </span><br>  <br> <span> <b>IFS</b> </span><span> : 0.5333333 </span><br>
  </pre5>
"""

soup = BeautifulSoup(html, 'lxml')
divTag = soup.find_all("pre5", {"style": ""})

import re

for tag in divTag:
    tdTags = tag.find_all(["span", "b"], text=re.compile(r'DS\s*:'))
    for tag in tdTags:
        if tag.nextSibling.strip():
            print tag.nextSibling.replace('[null]', '').strip()
        else:
            print tag.findNext("span").text.replace(':', '').strip()

这将为您提供输出:

0.36
0.007

答案 1 :(得分:0)

如果我理解正确,您说要提取的文字就在<span>标记之后,因此您可以使用next_element两次。第一个用于标签内的文本,第二个用于文本之后的文本。它似乎在这里工作,像这样:

from bs4 import BeautifulSoup  
html="""
<pre5 style="">                                                    
    <br><br>
    <span class="field-name">DS :</span>                               
    0.36 [null]<br><br> <br> <span> <b>FC</b> </span><span> : 0.0 </span><br>  <br> <span> <b>FDC</b> </span><span> : 0.36 </span><br>  <br> <span> <b>LDD</b> </span><span> : 4838400000 </span><br>  <br> <span> <b>IFS</b> </span><span> : 0.5333333 </span><br> 
  </pre5>
"""
soup = BeautifulSoup(html,'lxml')
divTag = soup.find_all("pre5", {"style":""})

for tag in divTag:
    tdTags = tag.find_all("span", {"class":"field-name"})
    for tag in tdTags:
        print tag.next_element.next_element.replace('[null]', '')

它产生(有些空格可以在之后移除):

0.36