我想从以下html块中提取信息,以使用HTML::TableExtract在Perl中提取带标题的特定列。
<tr>
<th>Lane</th>
<th>Sample ID</th>
<th>Sample Ref</th>
<th>Index</th>
<th>Description</th>
<th>Control</th>
<th>Project</th>
<th>Yield (Mbases)</th>
<th>% PF</th>
<th># Reads</th>
<th>% of raw clusters per lane</th>
<th>% Perfect Index Reads</th>
<th>% One Mismatch Reads (Index)</th>
<th>% of >= Q30 Bases (PF)</th>
<th>Mean Quality Score (PF)</th>
</tr>
</table></div>
<div ID="ScrollableTableBodyDiv"><table width="100%">
<col width="4%">
<col width="5%">
<col width="19%">
<col width="8%">
<col width="7%">
<col width="5%">
<col width="12%">
<col width="7%">
<col width="4%">
<col width="5%">
<col width="4%">
<col width="5%">
<col width="6%">
<col width="5%">
<col>
<tr>
<td>1</td>
<td>test3_5_1</td>
<td></td>
<td>NoIndex</td>
<td></td>
<td></td>
<td>ABC</td>
<td>20,091</td>
<td>100.00</td>
<td>200,905,366</td>
<td>100.00</td>
<td>0.00</td>
<td>0.00</td>
<td>87.39</td>
<td>34.75</td>
</tr>
<tr>
<td>2</td>
<td>test5_1</td>
<td></td>
<td>NoIndex</td>
<td></td>
<td></td>
<td>ABC</td>
<td>10,280</td>
<td>100.00</td>
<td>102,799,692</td>
<td>100.00</td>
<td>0.00</td>
<td>0.00</td>
<td>89.60</td>
<td>35.57</td>
</tr>
这样我可以输出
Lane Sample ID Sample Ref Index Description Control Yield (Mbases) % of >= Q30 Bases (PF)
1 test3_5_1 NoIndex 20,091 87.39
2 test5_1 NoIndex 10,280 89.6
列'Sample Ref','Description','Control'将为空,但必须打印。
我尝试过这样的事情
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use HTML::TableExtract;
my @header=("Lane", "Sample ID", "Sample Ref", "Index", "Description", "Control", " Yield (Mbases)", " % of >= Q30 Bases (PF)");
my $te= new HTML::TableExtract (depth=>0, count=>1, headers=> \@header );
$te->parse_file('testhtml.txt');
my $table = $te->first_table_found;
foreach my $table ( $te->tables ) {
foreach my $row ($table->rows) {
no warnings "uninitialized";
print " ", join("\t", @$row), "\n";
}
}
我无法获得所需的输出。请帮助改进我的代码。感谢
答案 0 :(得分:0)
罪魁祸首很可能是
头
作为数组引用传递,标头指定目标表中列顶部的感兴趣字符串。 它们可以是字符串或正则表达式(qr //)。如果它们是字符串,它们最终将通过非锚定的,不区分大小写的正则表达式传递,因此允许使用正则表达式特殊字符。 (强调我的)
一种选择是使用与要提取的每列对应的不同子字符串。另一种方法是将每个字符串传递给quotemeta。最后,您可以像qr/\QYield (Mbases)/
或qr/\Q% of >= Q30 Bases (PF)/
一样使用模式。我发现使用不同的子串更方便。毕竟,有时人们确实会更改列标题,但通常会有一些不同的子字符串,无论是否呈现都必须存在。
由于您未在示例中提供示例表主体,因此我使用keep_headers
选项来显示某些输出。它还可以为您提供所使用的实际标题。
#!/usr/bin/env perl
use strict;
use warnings;
use HTML::TableExtract;
my $te = HTML::TableExtract->new(
headers => [
'Lane',
'ID',
'Ref',
'Index',
'Description',
'Control',
'Yield',
'Q30 Bases',
],
keep_headers => 1,
);
$te->parse_file(\*DATA);
my @rows = $te->first_table_found->rows;
for my $row (@rows) {
print join("\n", map qq{'$_'}, @$row), "\n";
}
__DATA__
<div><table><tr>
<th>Lane</th>
<th>Sample ID</th>
<th>Sample Ref</th>
<th>Index</th>
<th>Description</th>
<th>Control</th>
<th>Project</th>
<th>Yield (Mbases)</th>
<th>% PF</th>
<th># Reads</th>
<th>% of raw clusters per lane</th>
<th>% Perfect Index Reads</th>
<th>% One Mismatch Reads (Index)</th>
<th>% of >= Q30 Bases (PF)</th>
<th>Mean Quality Score (PF)</th>
</tr>
</table></div>
输出:
$ ./jj.pl 'Lane' 'Sample ID' 'Sample Ref' 'Index' 'Description' 'Control' 'Yield (Mbases)' '% of >= Q30 Bases (PF)'