从HTML标记中解析文本

时间:2014-09-09 20:17:50

标签: html perl parsing tags

我使用Perl程序从一批.htm文件中提取文本,并将所有唯一的十字序列存储为哈希中的键(最终结果是哈希,每个唯一的十字序列为key和序列在所有文件中作为值出现的次数)。

我的问题是代码继续提取HTML标签以及文本,尽管有几次尝试使用HTML::Parser之类的模块来删除HTML。下面的代码不会产生任何错误消息,但它也没有删除HTML标记。任何见解?

#!/usr/bin/perl
use strict;
use warnings;

package MyParser;
use base qw(HTML::Parser);
my $p = HTML::Parser->new;

my $path = "U:/Perl/risk disclosures";
chdir($path) or die "Cant chdir to $path $!";

# This program counts the total number of unique six-grams in a 10-K and enumerates the frequency     of each one.
# Starting off computing a simple word count for each word in the 10-K.

my @sequence;
my %sequences;
my $fh;

# Here creating an array of ten-grams.
my @files = <*.htm>;
foreach my $file (@files) {
    open( IFILE, $file );
    while (<IFILE>) {
        $p->parse($_);
        for (split) {
            push @sequence, $_;
            if ( @sequence >= 10 ) {
                shift @sequence until @sequence == 10;
                ++$sequences{"@sequence"};
            }
        }
    }
}
close(IFILE);

1 个答案:

答案 0 :(得分:2)

使用Mojo::DOM从HTML文档中提取所有文本:

use strict;
use warnings;

use Mojo::DOM;

my $dom = Mojo::DOM->new(do {local $/; <DATA>});

my $text = $dom->all_text();

print $text;

__DATA__
<html>
<head>
<title>Hello World<title>
</head>
<body>
<h1>Header One</h1>
<p>Paragraph One, word one two three four five six seven eight nine <b>TEN</b> eleven
twelve thirteen fourteen.</p>
<p>Paragraph two, word one two three four five six seven eight nine <b>TEN</b> eleven
twelve thirteen fourteen fifteen</p>
</body>
</html>

输出:

Hello World Header One Paragraph One, word one two three four five six seven eight nine TEN eleven twelve thirteen fourteen. Paragraph two, word one two three four five six seven eight nine TEN eleven twelve thirteen fourteen fifteen

如果您只想要正文中的文字,请使用:

my $text = $dom->at('body')->all_text();

关于加载文件内容的附录

Mojo::DOM接受一串数据。它目前没有传递文件句柄的接口。

因此必须在实例化dom对象之前自己加载文件的内容:

#!/usr/bin/perl
# This program counts the total number of unique six-grams in a 10-K and enumerates the frequency of each one.
# Starting off computing a simple word count for each word in the 10-K.

use strict;
use warnings;
use autodie;

use Mojo::DOM;

my $path = "U:/Perl/risk disclosures";
chdir($path) or die "Cant chdir to $path $!";

for my $file (<*.htm>) {
    my $data = do {
        open my $fh, '<', $file;
        local $/;    # Slurp mode
        <$fh>;
    };
    my $dom  = Mojo::DOM->new($data);
    my $text = $dom->all_text();

    # Further processing from here
    ...;
}