Question

我有一个从wget获取的文件。

casperadm@casper:~> cat /tmp/one
<html>
<head>
<style>
a{text-decoration:none}
a:link{color:024C7E}
a:visited{color:024C7E}
a:active{color:958600}
body{font:10pt verdana;text-align:justify}
</style>
</head>
<body>
<pre>
x
-----
casper foo text
</body>
</html>

然后我在Perl中构建了一个非常简单的HTML解析

#!/usr/bin/perl -w
use warnings ;
use strict;

package HTMLStrip;
use base "HTML::Parser";

  subtext {
     my ($self, $text) = @_;
     print $text;
  }

  my $p = new HTMLStrip;
  # parse line-by-line, rather than the whole file at once
  while (<>) {
      $p->parse($_);
  }
 # flush and parse remaining unparsed HTML
  $p->eof;

分析工作正常，但是，它似乎忽略了内联样式，这是意外错误，并且使我不得不从中获取数据的旧网页的数据收集搞砸了。关于如何摆脱平滑的内联CSS样式的任何想法？

casperadm@casper:~> /tmp/pleaseParse /tmp/one
a{text-decoration:none}a:link{color:024C7E}a:visited{color:024C7E}a:active{color:958600}body{font:10pt verdana;text-align:justify}
x
-----
casper foo text

Answer 1

使用HTML::Tree中的HTML::TreeBuilder：

#!/usr/bin/perl
use strict;
use warnings;

use HTML::TreeBuilder;
my $parser = HTML::TreeBuilder->new()
    or die "can't create parser\n";

my $root = $parser->parse_file(\*DATA)
    or die "can't parse HTML\n";

#$root->dump();
my $style = $root->look_down(_tag => 'style')
    or die "can't find <style>!\n";

$style->dump();

# IMPORTANT: needs to be deleted() if you continue your code!
$root->delete();

exit 0;

__DATA__
<html>
<head>
<style>
a{text-decoration:none}
a:link{color:024C7E}
a:visited{color:024C7E}
a:active{color:958600}
body{font:10pt verdana;text-align:justify}
</style>
</head>
<body>
<pre>
x
-----
casper foo text
</body>
</html>

输出：

$ perl dummy.pl
<style> @0.0.0
  "\x0aa{text-decoration:none}\x0aa:link{color:024C7E}\x0aa:visited{color:024..."

使用HTML::Element方法来操纵$style指向的DOM节点。

使用perl进行HTML解析摆脱了内联CSS样式

1 个答案: