从html解析特定数据

时间:2014-02-25 18:53:20

标签: python html html-parsing beautifulsoup

我已经单独提取了第二个表,在第二个表中我需要提取column[0]中具有文件名的行。

<TABLE WIDTH="100%" BORDER="1" >
<TR ><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="2" WIDTH="70%">Root</TD></TR>
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Functions</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;10.1% (1077/10647)</TD></TR>
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Functions and exits</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;&#160;9.5% (2142/22473)</TD></TR>
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Statement blocks</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;&#160;9.1% (2191/24167)</TD></TR>
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Decisions</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;&#160;8.8% (2648/29930)</TD></TR>
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Loops</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;&#160;8.4% (305/3628)</TD></TR>
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Basic conditions</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;&#160;8.3% (1759/21254)</TD></TR>
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Modified conditions</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;&#160;1.8% (35/1997)</TD></TR>
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Multiple conditions</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;&#160;4.4% (137/3082)</TD></TR>

</TABLE>
</P>
<P ALIGN="LEFT"><BR>
2 - Files list</P>
<BR>
Display absolute values only.<BR>

<TABLE WIDTH="100%" BORDER="1" >
<TR BGCOLOR="#FFFF99"><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><b>Item<IMG SRC="cvi_sort_d.png" ALT="cvi_sort_d.xpm"></b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Functions</b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Functions and exits</b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Statement blocks</b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Decisions</b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Loops</b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Basic conditions</b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Modified conditions</b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Multiple conditions</b></TD></TR>
<TR ><TD BGCOLOR="#FF9999" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><B><A NAME="175746848"></A><a href="LOADER.H.html">LOADER.H</a></B></TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/1</P>
</TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/2</P>
</TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/1</P>
</TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/1</P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD></TR>
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175746912"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoaderState_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD></TR>
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175746976"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoadParameters_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD></TR>
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175747104"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoadOffsets_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD></TR>
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175747168"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoadAppComponent_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD></TR>
<TR ><TD BGCOLOR="#FF9999" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><B><A NAME="175746848"></A><a href="CORBA_FIXED.CC.html">CORBA_FIXED.CC</a></B></TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/1</P>
</TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/2</P>
</TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/1</P>
</TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/1</P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD></TR>
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175746912"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoaderState_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD></TR>
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175746976"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoadParameters_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD></TR>
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175747104"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoadOffsets_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD></TR>
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175747168"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoadAppComponent_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P>
</TD></TR>
</TABLE>

对于这种解析,我编写了一个python脚本,如下所示:

from bs4 import BeautifulSoup
f = open("/home/vignesh/Downloads/html/RateDoc.html","r")
fl = {'LOADER.H','CORBA_FIXED.H'}
soup = BeautifulSoup(f)
t = soup.findAll('table')
for table in t[1:]:
    rows = table.findAll('tr')
    for tr in rows[1:]:
        cols = tr.findAll('td')
        for td in cols:
            text = ''.join((td.find(text=True)).encode('utf-8'))
            print text+"\t",
        print
    print


the above script extracts the data as follows:


LOADER.H    0/1 0/2 0/1 0/1 none    none    none    none    
        none    none    none    none    none    none    none    none    
        none    none    none    none    none    none    none    none    
        none    none    none    none    none    none    none    none    
        none    none    none    none    none    none    none    none    
CORBA_FIXED.CC  0/1 0/2 0/1 0/1 none    none    none    none    
        none    none    none    none    none    none    none    none    
        none    none    none    none    none    none    none    none    
        none    none    none    none    none    none    none    none    
        none    none    none    none    none    none    none    none 

但预期结果如下,我想提取扩展名为*.cc*.h

的所有文件

需要输出:

LOADER.H    0/1 0/2 0/1 0/1 none    none    none    none    
CORBA_FIXED.CC  0/1 0/2 0/1 0/1 none    none    none    none    

有人可以帮我修改上述脚本,以便提取特定的附加信息*.cc*.h

2 个答案:

答案 0 :(得分:0)

如果您将数据封装在if中,它应该可以正常工作。基于以下事实:您要跳过的行的初始打印似乎显示空白条目 接下来是“无”&#39;

的八个值
if text is '':
  break
else:
  print text + '\t',

这是因为我目前无法对您的代码进行检查。

答案 1 :(得分:0)

from bs4 import BeautifulSoup

INPUT = "/home/vignesh/Downloads/html/RateDoc.html"

def main():
    with open(INPUT, "rb") as inf:
        soup = BeautifulSoup(inf)

    for row in soup.findAll("tr"):
        first_col = row.find("td")
        links = first_col.findAll("a")
        if len(links) == 2:
            link_text = links[1].text
            parts = link_text.rsplit(".", 1)
            if len(parts) > 1 and parts[-1].lower() in {"h", "cc"}:
                # print row
                print("\t".join(cell.text.strip().encode("utf-8") for cell in row.findAll("td")))

产生

LOADER.H    0/1 0/2 0/1 0/1 none    none    none    none
CORBA_FIXED.CC  0/1 0/2 0/1 0/1 none    none    none    none