我在一个文件夹中有大约一千个XML文件。每个XML文件中都包含大约100个项目。每个项目都在一个单独的行上。
我需要搜索和替换仅在
之间的文字<content:encoded><![CDATA[
和
]]></content:encoded>
我只需要替换以下内容:
'
已替换为'
"
已替换为"
<
已替换为<
>
已替换为>
我一直使用sed
进行大规模查找/替换,但是当我只想在这样的字符串之间查找/替换
我喜欢使用你认为最好的东西
答案 0 :(得分:0)
一旦找到“开始令牌”,解决方案就需要收集(匹配)除“结束令牌”之外的所有内容 - 但扫描字符串的否定是非常困难的。 (有关讨论,请参阅here和here)。
以下是一个受到一些合理(我相信)约束的解决方案
[start] stuff [start] stuff [end] stuff [end]
;和开始或结束标记都不能跨行分割,即
hello world <content:enco
ded><![CDATA[ [stuff] ... etc
我的解决方案很长,但是自由评论并且更直接而不是聪明(可以说)更容易维护;
use v5.12;
my $start_string = '<content:encoded><![CDATA[' ;
my $end_string = ']]></content:encoded>' ;
my $start_token = quotemeta $start_string ;
my $end_token = quotemeta $end_string ;
sub do_subs {
my $text = shift ;
$text =~ s/'/\'/g ;
$text =~ s/"/\"/g ;
$text =~ s/\</\</g ;
$text =~ s/\>/\>/g ;
return $text ;
}
my $subs_mode = 0; # "substitution mode" off/on
while (<>) {
my $line_remnants = $_ ; # what's left - intially, the whole line
my $replacement = "" ; # replacement for whole line
# while there's something left of the line to process
while ( ! $line_remnants eq "" ) {
if ($subs_mode) {
# Currently substituting. Scan for end_token
if ($line_remnants =~ /^ (.*?) $end_token (.* \n) /x) {
# Found end_token -> &do_subs on "preface" & add end_string
$replacement .= do_subs($1) . $end_string ;
$line_remnants = $2 ;
$subs_mode = 0 ;
}
else {
# Didn't find end_token -> &do_subs on all of what's left
$replacement .= do_subs($line_remnants) ;
$line_remnants = "" ;
}
}
else {
# Currently NOT substituting. Scan for start_token
if ($line_remnants =~ /^ (.*?) $start_token (.* \n) /x) {
# Found start_token -> append "preface" and start_string
$replacement .= $1 . $start_string ;
$line_remnants = $2 ;
$subs_mode = 1 ;
}
else {
# Didn't find start_token -> append all of what remains
$replacement .= $line_remnants ;
$line_remnants = "" ;
}
}
} # while ! $line_remnants ...
# Nothing left of line, print replacement
print $replacement
}
它是'unix filter'样式 - 在STDIN上读取,在STDOUT上转换和写入。喂这个时;
hello world
<content:encoded><![CDATA[ ' " ]]></content:encoded>
Here it comes: <content:encoded><![CDATA[ No quotes
like these in here ' " or relation ops like these < > ",>'
More non-allowed " ' <>'" - then the end: ]]></content:encoded>
these qotes should come through ' "<>
Start and End on one line - no data
<content:encoded><![CDATA[]]></content:encoded>
Start and End repeatedly on one line - single char
'<content:encoded><![CDATA[']]></content:encoded>'<content:encoded><![CDATA[']]></content:encoded>
......它产生;
hello world
<content:encoded><![CDATA[ ' " ]]></content:encoded>
Here it comes: <content:encoded><![CDATA[ No quotes
like these in here ' " or relation ops like these < > ",>'
More non-allowed " ' <>'" - then the end: ]]></content:encoded>
these qotes should come through ' "<>
Start and End on one line - no data
<content:encoded><![CDATA[]]></content:encoded>
Start and End repeatedly on one line - single char
'<content:encoded><![CDATA[']]></content:encoded>'<content:encoded><![CDATA[']]></content:encoded>
我希望它有一些用处。