在两个字符串之间查找和替换特定文本?

时间:2016-04-13 00:41:52

标签: bash perl shell command-line sed

我在一个文件夹中有大约一千个XML文件。每个XML文件中都包含大约100个项目。每个项目都在一个单独的行上。

我需要搜索和替换仅在

之间的文字
<content:encoded><![CDATA[

]]></content:encoded>

我只需要替换以下内容:

  • '已替换为&apos;
  • "已替换为&quot;
  • <已替换为&lt;
  • >已替换为&gt;

我一直使用sed进行大规模查找/替换,但是当我只想在这样的字符串之间查找/替换

时,它无法正常工作

我喜欢使用你认为最好的东西

1 个答案:

答案 0 :(得分:0)

一旦找到“开始令牌”,解决方案就需要收集(匹配)除“结束令牌”之外的所有内容 - 但扫描字符串的否定是非常困难的。 (有关讨论,请参阅herehere)。

以下是一个受到一些合理(我相信)约束的解决方案

  1. 令牌不能嵌套,[start] stuff [start] stuff [end] stuff [end];和
  2. 开始或结束标记都不能跨行分割,即

    hello world <content:enco

    ded><![CDATA[ [stuff] ... etc

  3. 我的解决方案很长,但是自由评论并且更直接而不是聪明(可以说)更容易维护;

    use v5.12;
    
    my $start_string =  '<content:encoded><![CDATA[' ;
    my $end_string   =  ']]></content:encoded>' ;
    my $start_token  =  quotemeta $start_string ;
    my $end_token    =  quotemeta $end_string ;
    
    sub do_subs {
        my $text = shift ;
        $text =~ s/'/\&apos;/g ;
        $text =~ s/"/\&quot;/g ;
        $text =~ s/\</\&lt;/g ;
        $text =~ s/\>/\&gt;/g ;
        return $text ;
    }
    
    my $subs_mode = 0;                # "substitution mode" off/on
    while (<>) {
        my $line_remnants = $_ ;      # what's left - intially, the whole line
        my $replacement = "" ;        # replacement for whole line
    
        # while there's something left of the line to process
        while ( ! $line_remnants eq "" )  {
            if ($subs_mode) {
                # Currently substituting.  Scan for end_token
                if ($line_remnants =~ /^ (.*?) $end_token (.* \n) /x)  {
                    # Found end_token -> &do_subs on "preface" & add end_string
                    $replacement .= do_subs($1) . $end_string ;
                    $line_remnants = $2 ;
                    $subs_mode = 0 ;
                }
                else {
                    # Didn't find end_token -> &do_subs on all of what's left
                    $replacement .= do_subs($line_remnants) ;
                    $line_remnants = "" ;
                }
            }
            else {
                # Currently NOT substituting.  Scan for start_token
                if ($line_remnants =~ /^ (.*?) $start_token (.* \n) /x)  {
                    # Found start_token -> append "preface" and start_string
                    $replacement .= $1 . $start_string ;
                    $line_remnants = $2 ;
                    $subs_mode = 1 ;
                }
                else {
                    # Didn't find start_token -> append all of what remains
                    $replacement .= $line_remnants ;
                    $line_remnants = "" ;
                }
            }
        } # while ! $line_remnants ...
    
        # Nothing left of line, print replacement
        print $replacement
    }
    

    它是'unix filter'样式 - 在STDIN上读取,在STDOUT上转换和写入。喂这个时;

    hello world
    <content:encoded><![CDATA[ ' " ]]></content:encoded>
    Here it comes: <content:encoded><![CDATA[ No quotes
    like these in here ' " or relation ops like these < > ",>'
    More non-allowed " ' <>'" - then the end: ]]></content:encoded>
    these qotes should come through ' "<>
    Start and End on one line - no data
    <content:encoded><![CDATA[]]></content:encoded>
    Start and End repeatedly on one line - single char
    '<content:encoded><![CDATA[']]></content:encoded>'<content:encoded><![CDATA[']]></content:encoded>
    

    ......它产生;

    hello world
    <content:encoded><![CDATA[ &apos; &quot; ]]></content:encoded>
    Here it comes: <content:encoded><![CDATA[ No quotes
    like these in here &apos; &quot; or relation ops like these &lt; &gt; &quot;,&gt;&apos;
    More non-allowed &quot; &apos; &lt;&gt;&apos;&quot; - then the end: ]]></content:encoded>
    these qotes should come through ' "<>
    Start and End on one line - no data
    <content:encoded><![CDATA[]]></content:encoded>
    Start and End repeatedly on one line - single char
    '<content:encoded><![CDATA[&apos;]]></content:encoded>'<content:encoded><![CDATA[&apos;]]></content:encoded>  
    

    我希望它有一些用处。