使用perl分割html文件

时间:2016-06-08 07:05:44

标签: perl

我是Perl的新手,但我尝试编写一个程序将单个HTML文件拆分为多个HTML文件。

#!/usr/bin/env perl

use strict;
#use warnings;

my @file_names;

## Read the list of file names
open( my $fh, "$ARGV[0]" );
while ( <$fh> ) {
    chomp;    #remove new line character from the end of the line
    push @file_names, $_;
}

my $counter = 0;
my ( $file_name, $fn );

## Read the input file
open( $fh, "$ARGV[1]" );
while ( <$fh> ) {

    ## If this is an opening class, open the next output file,
    ## and set $counter to 1.

    if ( / class="bch_ha"/ ) {
        $counter   = 1;
        $file_name = shift(@file_names);
        open( $fn, ">", "$file_name" );

        #print "<html>\n<body>";
    }

    ## If this is a closing class, print the line and set $counter back to 0

    if ( /\n<p sourcepage="(\d+)" class="bch_ha"/ ) {
        $counter = 0;
        print $fn $_;
        close($fn);
    }

    if ( / class="bcesu_tt"/ ) {
        $counter   = 1;
        $file_name = shift(@file_names);
        open( $fn, ">", "$file_name" );

        #print "<html>\n<body>";
    }

    if ( /\n<p sourcepage="(\d+)" class="bcekt_tt"/ ) {
        $counter = 0;
        print $fn $_;
        close($fn);
    }

    if (/ class="bcekt_tt"/ ) {
        $counter   = 1;
        $file_name = shift(@file_names);
        open( $fn, ">", "$file_name" );

        #print "<html>\n<body>";
    }

    if ( /\n<p sourcepage="(\d+)" class="bcepq_tt"/ ) {
        $counter = 0;
        print $fn $_;
        close($fn);
    }

    if ( / class="bcepq_tt"/ ) {
        $counter   = 1;
        $file_name = shift(@file_names);
        open( $fn, ">", "$file_name" );

        #print "<html>\n<body>";
    }

    if ( /\n<p sourcepage="(\d+)" class="bcecs_tt"/ ) {
        $counter = 0;
        print $fn $_;
        close($fn);
    }

    if ( / class="bcecs_tt"/ ) {
        $counter   = 1;
        $file_name = shift(@file_names);
        open( $fn, ">", "$file_name" );

        #print "<html>\n<body>";
    }

    if ( /\n<p sourcepage="(\d+)" class="bceex_tt"/ ) {
        $counter = 0;
        print $fn $_;
        close($fn);
    }

    if ( / class="bceex_tt"/ ) {
        $counter   = 1;
        $file_name = shift(@file_names);
        open( $fn, ">", "$file_name" );

        #print "<html>\n<body>";
    }

    if ( /\n<\/body>\n<\/html>/ ) {
        $counter = 0;
        print $fn $_;
        close($fn);
    }

    ## Print into the corresponding file handle if $counter is 1

    print $fn $_ if $counter == 1
}

我需要添加更多选项。代码应该要求分隔符的手动输入,分割文件应该转到文件夹名称chapterxx。请帮帮我

是的请找下面的HTML示例。

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="UTF-8" />
</head>
<body>
<p sourcepage="27" `class="bch_ha"`></p>
<p sourcepage="26"     class="bopob_ct">XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</p>
<p sourcepage="26"     class="bopob_cr">Xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%    <i>Xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</i></p>
<p sourcepage="26"     class="bch_nmword">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
<p sourcepage="26" class="bch_nm">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
<p sourcepage="26" class="bch_tt">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
<p sourcepage="26" class="bopob_tt">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%    <b>XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX%    </b>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
<p sourcepage="26"     class="bopob_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</i></p>
<p sourcepage="26" class="bopob_lbfirst">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
<p sourcepage="26" class="bopob_lb">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
<p sourcepage="26" class="bopob_lb">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
<p sourcepage="26" class="bopob_lb">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
<p sourcepage="26" class="bch_ha">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
<p sourcepage="26"     class="bopob_lblast">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b>    </p>
<p sourcepage="26"     class="bopcs_txfirst">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
<p sourcepage="26"     class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
<p sourcepage="26"     class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
<p sourcepage="27" class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
<p sourcepage="27" class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%<span class="sup">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</sup>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
<p sourcepage="27" class="bch_txfirst">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
</body>
</html>

我只需要将基于类class="bch_ha"的html拆分为下一个class="bch_ha",并将内容写入名为reader_0.html的新html中。文件名将是增量的,如reader_1.html。

1 个答案:

答案 0 :(得分:0)

也许这个例子可以让您了解如何完成您的计划。

在此示例中,重点是如何基于分隔符分割文件。

注意:只保存html正文。

#!/usr/bin/env perl
# test.pl

use strict;
use warnings;

my $file = './htmlInput.html'; # input file
my $delim = 'class="bch_ha"'; # delimiter
my $dir = 'chapter' . time; # folder with unix timestamp

# mkdir returns 1 if success
if ( mkdir($dir, 0755) ) {
    print "INFO: Created folder $dir to collect files.\n";  
} else {
    die "Can't make folder $dir\n";
}

# reader_x.html, x = [0..]
my $reader = 'reader_0.html';

my $fh2;
my $cnt = 0;
my $delim_first_time = 1;
open(my $fh, "<", $file) or die "Can't open and read $file: $!"; # read file
while (my $li = <$fh>) {
    last if ( $li =~ /<\/body>/ ); # quit the while loop

    if ( $delim_first_time && $li =~ /$delim/ ) { 
        open($fh2, ">", "./$dir/$reader") or die "Can't write to $reader : $!"; # write
        $delim_first_time = 0;
    } elsif ( $li =~ /$delim/ ) {
        close($fh2);
        $cnt++;
        $reader =~ s/[0-9]+/$cnt/; # reader_0.html -> reader_1.html
        open($fh2, ">", "./$dir/$reader") or die "Can't write to $reader : $!"; # write
    }
    print $fh2 $li if !$delim_first_time;
}
close($fh);
close($fh2);

# output:
# [~]$ ./test.pl
# INFO: Created folder chapter1465642603 to collect files.
# [~]$ ls chapter1465642603
# reader_0.html  reader_1.html
# [~]$ cat chapter1465642603/reader_0.html
# <p sourcepage="27" `class="bch_ha"`></p>
# <p sourcepage="26"     class="bopob_ct">XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</p>
# <p sourcepage="26"     class="bopob_cr">Xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%    <i>Xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</i></p>
# <p sourcepage="26"     class="bch_nmword">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
# <p sourcepage="26" class="bch_nm">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
# <p sourcepage="26" class="bch_tt">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
# <p sourcepage="26" class="bopob_tt">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%    <b>XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX%    </b>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
# <p sourcepage="26"     class="bopob_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</i></p>
# <p sourcepage="26" class="bopob_lbfirst">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
# <p sourcepage="26" class="bopob_lb">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
# <p sourcepage="26" class="bopob_lb">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
# <p sourcepage="26" class="bopob_lb">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
# [~]$
# [~]$ cat chapter1465642603/reader_1.html
# <p sourcepage="26" class="bch_ha">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
# <p sourcepage="26"     class="bopob_lblast">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b>    </p>
# <p sourcepage="26"     class="bopcs_txfirst">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
# <p sourcepage="26"     class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
# <p sourcepage="26"     class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
# <p sourcepage="27" class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
# <p sourcepage="27" class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%<span class="sup">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</sup>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
# <p sourcepage="27" class="bch_txfirst">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
# [~]$