我很确定这是非常基本的。但是我不知道Perl,只需要使用它一次。所以我感谢你的耐心等待。
我正在尝试从HTML下面的一行中删除不需要的文字:
<a target="_blank" href="http://sharepoint/sites/cerner/quickreferenceguides/Documents/EXP001_Run_Printable_TCI_List.pdf" onmouseover="return overlib('This guide outlines the process for running a printable TCI List', CAPTION, 'TCI LIST');" onmouseout="return nd();">Run Printable TCI List (<i>Revised<i>)</a>
我希望留下的只是Run Printable TCI List (<i>Revised</i>)
,这是</a>
之前的文字。我有大约500行这些行,因为它们将来可以改变,所以创建一个程序是有意义的。下面是我到目前为止的Perl代码:
open (SEARK, 'C:\\HTMLsorter\\sources.txt');
open (OUTSEARK, '>C:\\HTMLsorter\\outseark.txt');
while(<SEARK>) {
chomp;
if ($_=~/<a target/) {
$_ =~ s/\<i>//g;
$_ =~ s/\<\/i>//g;
@itemsa = split(/>/);
@itemsb = split(/</, $itemsa[1]);
print OUTSEARK ("$itemsb[0]\n");
}
}
close (SEARK);
close (OUTSEARK);
我相信你可以阅读这个,但只是为了解释我正在打开一个名为sources.txt
的文件,其中有500行需要排序。输出文件为outseark.txt
。到目前为止它将输出:
Run Printable TCI List (Revised)
这显然是由于针对箭头内部和周围的所有事物的分裂。我是如何将斜体保留在括号内的?留下来:
Run Printable TCI List (<i>Revised<i>)
感谢您的光临。
答案 0 :(得分:1)
#!/usr/bin/perl
use strict;
use warnings;
open IFH, '<myfile.txt';
open OFH, '>output.txt';
while (<IFH>) {
if (/<a\s+target.*?>(.*?)<\/a>/i)
{
$_ = $1;
s/<.*?>//g;
print OFH "$_\n";
}
}
close IFH;
close OFH;
答案 1 :(得分:0)
你可以在一个班轮上做到这一点。
cat inputfile|perl -ne 'if (s#<a\s+target[^>]+>(.+?)</a>##is){print "$1\n";}'>outputfile
它正在运作:
echo '<a target="_blank" href="http://sharepoint/sites/cerner/quickreferenceguides/Documents/EXP001_Run_Printable_TCI_List.pdf" onmouseover="return overlib('This guide outlines the process for running a printable TCI List', CAPTION, 'TCI LIST');" onmouseout="return nd();">Run Printable TCI List (<i>Revised<i>)</a>
<a target="_blank" href="http://sharepoint/sites/cerner/quickreferenceguides/Documents/EXP001_Run_Printable_TCI_List.pdf" onmouseover="return overlib('This guide outlines the process for running a printable TCI List', CAPTION, 'TCI LIST');" onmouseout="return nd();">Run Printable TCI List 1(<i>Revised<i>)</a>
<a target="_blank" href="http://sharepoint/sites/cerner/quickreferenceguides/Documents/EXP001_Run_Printable_TCI_List.pdf" onmouseover="return overlib('This guide outlines the process for running a printable TCI List', CAPTION, 'TCI LIST');" onmouseout="return nd();">Run Printable TCI List 2(<i>Revised<i>)</a>
<a target="_blank" href="http://sharepoint/sites/cerner/quickreferenceguides/Documents/EXP001_Run_Printable_TCI_List.pdf" onmouseover="return overlib('This guide outlines the process for running a printable TCI List', CAPTION, 'TCI LIST');" onmouseout="return nd();">Run Printable TCI List 3(<i>Revised<i>)</a>'|\
perl -ne 'if (s#<a\s+target[^>]+>(.+?)</a>##is){print "$1\n";}'
Run Printable TCI List (<i>Revised<i>)
Run Printable TCI List 1(<i>Revised<i>)
Run Printable TCI List 2(<i>Revised<i>)
Run Printable TCI List 3(<i>Revised<i>)
答案 2 :(得分:0)
您应该使用正确的HTML解析器,例如HTML::TreeBuilder
。代码并不复杂,因为该程序演示了
use strict;
use warnings;
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new_from_file(*DATA);
print $_->as_text, "\n" for $tree->look_down(_tag => 'a', target => qr/./);
__DATA__
<a target="_blank" href="http://sharepoint/sites/cerner/quickreferenceguides/Documents/EXP001_Run_Printable_TCI_List.pdf" onmouseover="return overlib('This guide outlines the process for running a printable TCI List', CAPTION, 'TCI LIST');" onmouseout="return nd();">Run Printable TCI List (<i>Revised<i>)</a>
<强>输出强>
Run Printable TCI List (Revised)
修改强>
要在示例中的文件中使用此技术,代码如下所示
use strict;
use warnings;
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new_from_file('C:\HTMLsorter\sources.txt');
open my $out, '>', 'C:\HTMLsorter\outseark.txt' or die $!;
print $out $_->as_text, "\n" for $tree->look_down(_tag => 'a', target => qr/./);
修改2
现在我更了解您的需求,我可以提供这种替代解决方案。它使用HTML::DOM
模块访问HTML文档的文档对象模型,因为获取HTML::TreeBuilder
所需的结果相对困难。
我还注意到您的示例HTML包含<i>Revised<i>
,显然应该是<i>Revised</i>
,我已经针对此示例测试对其进行了更正。无论如何,Perl试图像浏览器那样解析坏HTML,即使出现错误,输出也是可用的。
use strict;
use warnings;
use HTML::DOM;
my $dom = HTML::DOM->new;
$dom->parse_file('C:\HTMLsorter\sources.txt') or die $!;
open my $out, '>', 'C:\HTMLsorter\outseark.txt' or die $!;
print $out $_->innerHTML, "\n" for grep $_->attr('target'), $dom->getElementsByTagName('a');
<强>输出强>
(标签已更正)
Run Printable TCI List (<i>Revised</i>)
(使用原始标签)
Run Printable TCI List (<i>Revised<i>)</i></i>