Question

我在一个文件夹中有大约一千个XML文件。每个XML文件中都包含大约100个项目。每个项目都在一个单独的行上。

我需要搜索和替换仅在

之间的文字

<content:encoded><![CDATA[

和

]]></content:encoded>

我只需要替换以下内容：

'已替换为'
"已替换为"
<已替换为<
>已替换为>

我一直使用sed进行大规模查找/替换，但是当我只想在这样的字符串之间查找/替换

时，它无法正常工作

我喜欢使用你认为最好的东西

Answer 1

一旦找到“开始令牌”，解决方案就需要收集（匹配）除“结束令牌”之外的所有内容 - 但扫描字符串的否定是非常困难的。（有关讨论，请参阅here和here）。

以下是一个受到一些合理（我相信）约束的解决方案

令牌不能嵌套，[start] stuff [start] stuff [end] stuff [end];和
开始或结束标记都不能跨行分割，即

hello world <content:enco

ded><![CDATA[ [stuff] ... etc

我的解决方案很长，但是自由评论并且更直接而不是聪明（可以说）更容易维护;

use v5.12;

my $start_string =  '<content:encoded><![CDATA[' ;
my $end_string   =  ']]></content:encoded>' ;
my $start_token  =  quotemeta $start_string ;
my $end_token    =  quotemeta $end_string ;

sub do_subs {
    my $text = shift ;
    $text =~ s/'/\&apos;/g ;
    $text =~ s/"/\&quot;/g ;
    $text =~ s/\</\&lt;/g ;
    $text =~ s/\>/\&gt;/g ;
    return $text ;
}

my $subs_mode = 0;                # "substitution mode" off/on
while (<>) {
    my $line_remnants = $_ ;      # what's left - intially, the whole line
    my $replacement = "" ;        # replacement for whole line

    # while there's something left of the line to process
    while ( ! $line_remnants eq "" )  {
        if ($subs_mode) {
            # Currently substituting.  Scan for end_token
            if ($line_remnants =~ /^ (.*?) $end_token (.* \n) /x)  {
                # Found end_token -> &do_subs on "preface" & add end_string
                $replacement .= do_subs($1) . $end_string ;
                $line_remnants = $2 ;
                $subs_mode = 0 ;
            }
            else {
                # Didn't find end_token -> &do_subs on all of what's left
                $replacement .= do_subs($line_remnants) ;
                $line_remnants = "" ;
            }
        }
        else {
            # Currently NOT substituting.  Scan for start_token
            if ($line_remnants =~ /^ (.*?) $start_token (.* \n) /x)  {
                # Found start_token -> append "preface" and start_string
                $replacement .= $1 . $start_string ;
                $line_remnants = $2 ;
                $subs_mode = 1 ;
            }
            else {
                # Didn't find start_token -> append all of what remains
                $replacement .= $line_remnants ;
                $line_remnants = "" ;
            }
        }
    } # while ! $line_remnants ...

    # Nothing left of line, print replacement
    print $replacement
}

它是'unix filter'样式 - 在STDIN上读取，在STDOUT上转换和写入。喂这个时;

hello world
<content:encoded><![CDATA[ ' " ]]></content:encoded>
Here it comes: <content:encoded><![CDATA[ No quotes
like these in here ' " or relation ops like these < > ",>'
More non-allowed " ' <>'" - then the end: ]]></content:encoded>
these qotes should come through ' "<>
Start and End on one line - no data
<content:encoded><![CDATA[]]></content:encoded>
Start and End repeatedly on one line - single char
'<content:encoded><![CDATA[']]></content:encoded>'<content:encoded><![CDATA[']]></content:encoded>

......它产生;

hello world
<content:encoded><![CDATA[ &apos; &quot; ]]></content:encoded>
Here it comes: <content:encoded><![CDATA[ No quotes
like these in here &apos; &quot; or relation ops like these &lt; &gt; &quot;,&gt;&apos;
More non-allowed &quot; &apos; &lt;&gt;&apos;&quot; - then the end: ]]></content:encoded>
these qotes should come through ' "<>
Start and End on one line - no data
<content:encoded><![CDATA[]]></content:encoded>
Start and End repeatedly on one line - single char
'<content:encoded><![CDATA[&apos;]]></content:encoded>'<content:encoded><![CDATA[&apos;]]></content:encoded>

我希望它有一些用处。

在两个字符串之间查找和替换特定文本？

1 个答案: