Question

我有一个有趣的问题。我编写了以下perl脚本来递归遍历目录，并在html个img/script/a个文件的所有<a href="..."><img src="..."></a>个文件中执行以下操作：

将整个网址转换为小写
用下划线替换空格和％20

除非使用锚标记包装图像标记，否则该脚本的效果很好。有没有办法修改当前脚本，以便能够操作不在单独行上的嵌套标记的链接？基本上，如果我有img，脚本只会更改锚标记中的链接，但会跳过#!/usr/bin/perl use File::Find; $input="/var/www/tecnew/"; sub process { if (-T and m/.+\.(htm|html)/i) { #print "htm/html: $_\n"; open(FILE,"+<$_") or die "couldn't open file $!\n"; $out = ''; while(<FILE>) { $cur_line = $_; if($cur_line =~ m/<a.*>/i) { print "cur_line (unaltered) $cur_line\n"; $cur_line =~ /(^.* href=\")(.+?)(\".*$)/i; $beg = $1; $link = html_clean($2); $end = $3; $cur_line = $beg.$link.$end; print "cur_line (altered) $cur_line\n"; } if($cur_line =~ m/(<img.*>|<script.*>)/i) { print "cur_line (unaltered) $cur_line\n"; $cur_line =~ /(^.* src=\")(.+?)(\".*$)/i; $beg = $1; $link = html_clean($2); $end = $3; $cur_line = $beg.$link.$end; print "cur_line (altered) $cur_line\n"; } $out .= $cur_line; } seek(FILE, 0, 0) or die "can't seek to start of file: $!"; print FILE $out or die "can't print to file: $1"; truncate(FILE, tell(FILE)) or die "can't truncate file: $!"; close(FILE) or die "can't close file: $!"; } } find(\&process, $input); sub html_clean { my($input_string) = @_; $input_string = lc($input_string); $input_string =~ s/%20|\s/_/g; return $input_string; }标记。

{{1}}

Answer 1

您是否考虑使用真正的解析器而不是正则表达式？正则表达式not suitable用于解析HTML！考虑使用像HTML::Parser这样的解析器。

Answer 2

我实际上建议将整个HTML文本放入内存并进行多行搜索并替换为RE匹配的整个标记：

text =~ s/(<a[^>]+href=")([^"]+)("[^>]+>.*?</a>)/$1 . &html_clean($2) . $3/ge

编辑不在单独行上的嵌套标记中的超链接

2 个答案: