HTML表格提取 - 导出到csv,由于某些行格式不同而导致输出中断

时间:2014-02-09 22:57:30

标签: perl bash sed awk grep

此代码运行良好,但有些行不遵循表格方案。有时会有像这样的行

<td colspan="7">
<div class="note">lots of notes here</div>
</td>

导致CSV中断,有没有办法使用下面的代码忽略这些类型的行?

my $te = 'HTML::TableExtract'
     ->new(headers => ['Data1', 'Data 2', 'Data 3', 'Data 4',
                       'Data 4', 'Data 5', 'Data 6']);

my $csv = 'Text::CSV'->new({ binary       => 1,
                         eol          => "\n",
                         always_quote => 1,
                       });

while (@ARGV) {
my $file = shift;
open my $IN, '<', $file or die $!;
my $html = do { local $/; <$IN> };
$te->parse($html);
}
for my $table ($te->tables) {
$csv->print(*STDOUT{IO}, $_) for $table->rows;
}

这是一个较大的HTML文件样本:

<table class="datalogs" cellspacing="5px">
<tr><th>Data1</th><th>Data 2</th><th>Data 3</th><th>Data 4</th><th>Data 4< /th>< th>Data 5</th><th>Data 6</th></tr>
<tr class="odd"><td valign="top"><h4>123<br/></h4></td><td valign="top">AAA</td><td valign="top"><b>url here</b></td><td valign="top">Yes</td><td valign="top">None</td><td valign="top"></td><td valign="top"></td></tr><tr class="even">...</td></tr>
<td colspan="7">
<div class="note">lots of notes here</div>
</td>
</table>

这是当前的样本输出:

"Other","JPEG","http://URL/jpg/image.jpg","No","None",,
"Other","JPEG","http://URL/","Yes","None",,
"Other","PNG","http://URL:80/","Yes","None",,
"Othe","GIF","http://URL/GetData?y=1","No","None",,

感谢您的帮助

1 个答案:

答案 0 :(得分:2)

我的文档:

  

具有rowspan或colspan属性的表将包含一些单元格   包含undef。

您需要检查所有7个单元格是否具有undef值,而不是忽略行。

my $te = 'HTML::TableExtract'
     ->new(headers => ['Data1', 'Data 2', 'Data 3', 'Data 4',
                       'Data 4', 'Data 5', 'Data 6']);

my $csv = 'Text::CSV'->new({ binary       => 1,
                         eol          => "\n",
                         always_quote => 1,
                       });

while (@ARGV) {
  my $file = shift;
  open my $IN, '<', $file or die $!;
  my $html = do { local $/; <$IN> };
  $te->parse($html);
}

for my $table ($te->tables) {
  for my $row ($table->rows){
    my @val = grep { defined $_ } @{$row} ;
    next if scalar( @val) == 1;
    $csv->print(*STDOUT{IO}, $row);
  }
}