perl HTML :: TableExtract获取剥离文本

时间:2011-07-19 14:18:31

标签: perl html-parsing

HTML中我的表格行如下,

<TR bgcolor="#FFFFFF" onmouseover="this.bgColor='#DBE9FF';" onmouseout="this.bgColor='#FFFFFF';">
   <TD  class="dlfont">07/01/2011 10:33 AM EDT</B>&nbsp;</TD>
   <TD  class="dlfont">DRB</B>&nbsp;</TD><TD  class="dlfont">Blah</B>&nbsp;</TD>
   <TD  class="dlfont">PPD</B>&nbsp;</TD><TD  class="dlfont"> </B>&nbsp;</TD>
   <TD  class="dlfont">07/01/2011</B>&nbsp;</TD>
   <TD width=50 align=center><A HREF="javascript:parent.nav.details('0701201110:33AMEDTDRBPPD')"><IMG border='0' src='/images/view.gif' height=10 width=19></A></TD>
</TR>


<TR bgcolor="#EEEEEE" onmouseover="this.bgColor='#DBE9FF';" onmouseout="this.bgColor='#EEEEEE';">
    <TD  class="dlfont">07/01/2011 10:33 AM EDT</B>&nbsp;</TD>
    <TD  class="dlfont">WHPSF</B>&nbsp;</TD>
    <TD  class="dlfont">Blah</B>&nbsp;</TD>
    <TD  class="dlfont"> </B>&nbsp;</TD>
    <TD  class="dlfont"> </B>&nbsp;</TD>
    <TD  class="dlfont">07/01/2011</B>&nbsp;</TD>  
    <TD width=50 align=center><A HREF="javascript:parent.nav.details('0701201110:33AMEDTWHPSF')"><IMG border='0' src='/images/view.gif' height=10 width=19></A></TD>
</TR>

当我使用HTML :: TableExtract提取行时,额外的字符</B>&nbsp;也出现在最后并形成某种特殊字符。我怎么能摆脱这个?

1 个答案:

答案 0 :(得分:1)

在你的问题中使用HTML :: TableExtract和格式错误的HTML时,我会记住两件事

  1. 在HTML :: TableExtract构造函数中使用keep_html=>1
  2. 使用正则表达式删除</B>&nbsp;小心
  3. 这是我编写的用于修剪表格单元格中的</B>&nbsp;的一些Perl代码,但请注意,如果您在所有情况下盲目应用它,这可能会将格式有效的HTML更改为格式错误的HTML。

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    use HTML::TableExtract;
    
    my($f) = @ARGV;
    open F,$f;
    my $html = join '',<F>;
    close F;
    
    ### your html didn't include headers, so I added a first table row with td text, time a b c d e f, to help HTML::TableExtract find the table in file, $f 
    my $te = HTML::TableExtract->new(
        keep_html=>1,
        headers=>[qw/ time a b c d e f/]);
    
    $te->parse($html);
    
    for my $ts($te->tables)
    {
        print "Table(",join(',',$ts->coords),":\n";
        for my $row ($ts->rows)
        {
            for my $cell (@$row)
            {
                next unless $cell;
                        ## maybe add $ at end of regex or other test here to make sure valid cases of <B>...</B>&nbsp; are not affected
                $cell =~ s/<\/B>&nbsp;//i;
                print $cell."\n";
            }
        }
    }