我是Perl的新手,但我尝试编写一个程序将单个HTML文件拆分为多个HTML文件。
#!/usr/bin/env perl
use strict;
#use warnings;
my @file_names;
## Read the list of file names
open( my $fh, "$ARGV[0]" );
while ( <$fh> ) {
chomp; #remove new line character from the end of the line
push @file_names, $_;
}
my $counter = 0;
my ( $file_name, $fn );
## Read the input file
open( $fh, "$ARGV[1]" );
while ( <$fh> ) {
## If this is an opening class, open the next output file,
## and set $counter to 1.
if ( / class="bch_ha"/ ) {
$counter = 1;
$file_name = shift(@file_names);
open( $fn, ">", "$file_name" );
#print "<html>\n<body>";
}
## If this is a closing class, print the line and set $counter back to 0
if ( /\n<p sourcepage="(\d+)" class="bch_ha"/ ) {
$counter = 0;
print $fn $_;
close($fn);
}
if ( / class="bcesu_tt"/ ) {
$counter = 1;
$file_name = shift(@file_names);
open( $fn, ">", "$file_name" );
#print "<html>\n<body>";
}
if ( /\n<p sourcepage="(\d+)" class="bcekt_tt"/ ) {
$counter = 0;
print $fn $_;
close($fn);
}
if (/ class="bcekt_tt"/ ) {
$counter = 1;
$file_name = shift(@file_names);
open( $fn, ">", "$file_name" );
#print "<html>\n<body>";
}
if ( /\n<p sourcepage="(\d+)" class="bcepq_tt"/ ) {
$counter = 0;
print $fn $_;
close($fn);
}
if ( / class="bcepq_tt"/ ) {
$counter = 1;
$file_name = shift(@file_names);
open( $fn, ">", "$file_name" );
#print "<html>\n<body>";
}
if ( /\n<p sourcepage="(\d+)" class="bcecs_tt"/ ) {
$counter = 0;
print $fn $_;
close($fn);
}
if ( / class="bcecs_tt"/ ) {
$counter = 1;
$file_name = shift(@file_names);
open( $fn, ">", "$file_name" );
#print "<html>\n<body>";
}
if ( /\n<p sourcepage="(\d+)" class="bceex_tt"/ ) {
$counter = 0;
print $fn $_;
close($fn);
}
if ( / class="bceex_tt"/ ) {
$counter = 1;
$file_name = shift(@file_names);
open( $fn, ">", "$file_name" );
#print "<html>\n<body>";
}
if ( /\n<\/body>\n<\/html>/ ) {
$counter = 0;
print $fn $_;
close($fn);
}
## Print into the corresponding file handle if $counter is 1
print $fn $_ if $counter == 1
}
我需要添加更多选项。代码应该要求分隔符的手动输入,分割文件应该转到文件夹名称chapterxx
。请帮帮我
是的请找下面的HTML示例。
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="UTF-8" />
</head>
<body>
<p sourcepage="27" `class="bch_ha"`></p>
<p sourcepage="26" class="bopob_ct">XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</p>
<p sourcepage="26" class="bopob_cr">Xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx% <i>Xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</i></p>
<p sourcepage="26" class="bch_nmword">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
<p sourcepage="26" class="bch_nm">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
<p sourcepage="26" class="bch_tt">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
<p sourcepage="26" class="bopob_tt">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx% <b>XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX% </b>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
<p sourcepage="26" class="bopob_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</i></p>
<p sourcepage="26" class="bopob_lbfirst">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
<p sourcepage="26" class="bopob_lb">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
<p sourcepage="26" class="bopob_lb">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
<p sourcepage="26" class="bopob_lb">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
<p sourcepage="26" class="bch_ha">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
<p sourcepage="26" class="bopob_lblast">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b> </p>
<p sourcepage="26" class="bopcs_txfirst">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
<p sourcepage="26" class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
<p sourcepage="26" class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
<p sourcepage="27" class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
<p sourcepage="27" class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%<span class="sup">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</sup>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
<p sourcepage="27" class="bch_txfirst">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
</body>
</html>
我只需要将基于类class="bch_ha"
的html拆分为下一个class="bch_ha"
,并将内容写入名为reader_0.html的新html中。文件名将是增量的,如reader_1.html。
答案 0 :(得分:0)
也许这个例子可以让您了解如何完成您的计划。
在此示例中,重点是如何基于分隔符分割文件。
注意:只保存html正文。
#!/usr/bin/env perl
# test.pl
use strict;
use warnings;
my $file = './htmlInput.html'; # input file
my $delim = 'class="bch_ha"'; # delimiter
my $dir = 'chapter' . time; # folder with unix timestamp
# mkdir returns 1 if success
if ( mkdir($dir, 0755) ) {
print "INFO: Created folder $dir to collect files.\n";
} else {
die "Can't make folder $dir\n";
}
# reader_x.html, x = [0..]
my $reader = 'reader_0.html';
my $fh2;
my $cnt = 0;
my $delim_first_time = 1;
open(my $fh, "<", $file) or die "Can't open and read $file: $!"; # read file
while (my $li = <$fh>) {
last if ( $li =~ /<\/body>/ ); # quit the while loop
if ( $delim_first_time && $li =~ /$delim/ ) {
open($fh2, ">", "./$dir/$reader") or die "Can't write to $reader : $!"; # write
$delim_first_time = 0;
} elsif ( $li =~ /$delim/ ) {
close($fh2);
$cnt++;
$reader =~ s/[0-9]+/$cnt/; # reader_0.html -> reader_1.html
open($fh2, ">", "./$dir/$reader") or die "Can't write to $reader : $!"; # write
}
print $fh2 $li if !$delim_first_time;
}
close($fh);
close($fh2);
# output:
# [~]$ ./test.pl
# INFO: Created folder chapter1465642603 to collect files.
# [~]$ ls chapter1465642603
# reader_0.html reader_1.html
# [~]$ cat chapter1465642603/reader_0.html
# <p sourcepage="27" `class="bch_ha"`></p>
# <p sourcepage="26" class="bopob_ct">XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</p>
# <p sourcepage="26" class="bopob_cr">Xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx% <i>Xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</i></p>
# <p sourcepage="26" class="bch_nmword">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
# <p sourcepage="26" class="bch_nm">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
# <p sourcepage="26" class="bch_tt">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
# <p sourcepage="26" class="bopob_tt">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx% <b>XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX% </b>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
# <p sourcepage="26" class="bopob_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</i></p>
# <p sourcepage="26" class="bopob_lbfirst">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
# <p sourcepage="26" class="bopob_lb">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
# <p sourcepage="26" class="bopob_lb">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
# <p sourcepage="26" class="bopob_lb">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
# [~]$
# [~]$ cat chapter1465642603/reader_1.html
# <p sourcepage="26" class="bch_ha">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p>
# <p sourcepage="26" class="bopob_lblast">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b> </p>
# <p sourcepage="26" class="bopcs_txfirst">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
# <p sourcepage="26" class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
# <p sourcepage="26" class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
# <p sourcepage="27" class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
# <p sourcepage="27" class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%<span class="sup">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</sup>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
# <p sourcepage="27" class="bch_txfirst">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p>
# [~]$