HTML中我的表格行如下,
<TR bgcolor="#FFFFFF" onmouseover="this.bgColor='#DBE9FF';" onmouseout="this.bgColor='#FFFFFF';">
<TD class="dlfont">07/01/2011 10:33 AM EDT</B> </TD>
<TD class="dlfont">DRB</B> </TD><TD class="dlfont">Blah</B> </TD>
<TD class="dlfont">PPD</B> </TD><TD class="dlfont"> </B> </TD>
<TD class="dlfont">07/01/2011</B> </TD>
<TD width=50 align=center><A HREF="javascript:parent.nav.details('0701201110:33AMEDTDRBPPD')"><IMG border='0' src='/images/view.gif' height=10 width=19></A></TD>
</TR>
<TR bgcolor="#EEEEEE" onmouseover="this.bgColor='#DBE9FF';" onmouseout="this.bgColor='#EEEEEE';">
<TD class="dlfont">07/01/2011 10:33 AM EDT</B> </TD>
<TD class="dlfont">WHPSF</B> </TD>
<TD class="dlfont">Blah</B> </TD>
<TD class="dlfont"> </B> </TD>
<TD class="dlfont"> </B> </TD>
<TD class="dlfont">07/01/2011</B> </TD>
<TD width=50 align=center><A HREF="javascript:parent.nav.details('0701201110:33AMEDTWHPSF')"><IMG border='0' src='/images/view.gif' height=10 width=19></A></TD>
</TR>
当我使用HTML :: TableExtract提取行时,额外的字符</B>
也出现在最后并形成某种特殊字符。我怎么能摆脱这个?
答案 0 :(得分:1)
在你的问题中使用HTML :: TableExtract和格式错误的HTML时,我会记住两件事
keep_html=>1
</B>
,小心 这是我编写的用于修剪表格单元格中的</B>
的一些Perl代码,但请注意,如果您在所有情况下盲目应用它,这可能会将格式有效的HTML更改为格式错误的HTML。
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
my($f) = @ARGV;
open F,$f;
my $html = join '',<F>;
close F;
### your html didn't include headers, so I added a first table row with td text, time a b c d e f, to help HTML::TableExtract find the table in file, $f
my $te = HTML::TableExtract->new(
keep_html=>1,
headers=>[qw/ time a b c d e f/]);
$te->parse($html);
for my $ts($te->tables)
{
print "Table(",join(',',$ts->coords),":\n";
for my $row ($ts->rows)
{
for my $cell (@$row)
{
next unless $cell;
## maybe add $ at end of regex or other test here to make sure valid cases of <B>...</B> are not affected
$cell =~ s/<\/B> //i;
print $cell."\n";
}
}
}