使用Perl解析HTML文件

时间:2015-09-07 05:52:46

标签: html perl parsing

我正在尝试使用perl在这个html文件中提取表。

我试过这个:

my $te = HTML::TableExtract->new();
$te->parse_file($g_log);
print "=====TE: $te ======\n";

输出是:

HTML:TableExtract = Hash(0x266f5f)

我试过迭代$ te而没有发现任何东西。任何人都可以指导下一步做什么。我是新手。

这是HTML文件:

    <html xmlns="http://www.w3.org/1999/xhtml" xmlns:math="http://exslt.org/math"
          xmlns:testng="http://testng.org">
       <head xmlns="">
          <title>TestNG Results</title>
          <meta http-equiv="content-type" content="text/html; charset=utf-8"></meta>
          <meta http-equiv="pragma" content="no-cache"></meta>
          <meta http-equiv="cache-control" content="max-age=0"></meta>
          <meta http-equiv="cache-control" content="no-cache"></meta>
          <meta http-equiv="cache-control" content="no-store"></meta>
          <LINK rel="stylesheet" href="style.css"></LINK>
          <script type="text/javascript" src="main.js"></script>
       </head>
       <body>
          <h2>Test suites overview</h2>


<table width="100%">
                 <tr>
                    <td align="center" id="chart-container"><script type="text/javascript">
                                            renderSvgEmbedTag(600, 200);
                                        </script></td>
                 </tr>
              </table>

   </body>
  </html>

2 个答案:

答案 0 :(得分:2)

#!/usr/bin/perl
#use strict;
use warnings;
use HTML::TableExtract;
my $filename = "testfile.html";
my $te = HTML::TableExtract->new();
$te->parse_file($filename);
foreach $ts ($te->tables) {
   print "Table found at ", join(',', $ts->coords), ":\n";
   foreach $row ($ts->rows) {
      print "   ", join(',', @$row), "\n";
   }
}

请注意,HTML::TableExtract也可以在'tree' mode中调用,其中生成的HTML和提取的表格以HTML::Element树结构进行编码。

use HTML::TableExtract 'tree';

答案 1 :(得分:1)

不确定你想要离开桌子的是什么。但我强烈建议使用数据转储器查看哈希内部。

#!/usr/bin/perl

use strict;
use warnings;
use HTML::TableExtract;
use Data::Dumper;

my $html = <<'EOT';
<html xmlns="http://w...content-available-to-author-only...3.org/1999/xhtml" xmlns:math="http://e...content-available-to-author-only...t.org/math"
          xmlns:testng="http://t...content-available-to-author-only...g.org">
       <head xmlns="">
          <title>TestNG Results</title>
          <meta http-equiv="content-type" content="text/html; charset=utf-8"></meta>
          <meta http-equiv="pragma" content="no-cache"></meta>
          <meta http-equiv="cache-control" content="max-age=0"></meta>
          <meta http-equiv="cache-control" content="no-cache"></meta>
          <meta http-equiv="cache-control" content="no-store"></meta>
          <LINK rel="stylesheet" href="style.css"></LINK>
          <script type="text/javascript" src="main.js"></script>
       </head>
       <body>
          <h2>Test suites overview</h2>


<table width="100%">
                 <tr>
                    <td align="center" id="chart-container"><script type="text/javascript">
                                            renderSvgEmbedTag(600, 200);
                                        </script></td>
                 </tr>
              </table>

    </table>
   </body>
  </html>
EOT

my $te = HTML::TableExtract->new();
$te->parse($html);

print Dumper($te);