跨多行删除XML元素

Question

我有一个sed命令，我想在一个巨大的，可怕的，丑陋的HTML文件上运行，该文件是从Microsoft Word文档创建的。它应该做的只是删除字符串的任何实例

style='text-align:center; color:blue;
exampleStyle:exampleValue'

我想修改的sed命令是

sed "s/ style='[^']*'//" fileA > fileB

它很有效，只要匹配文本中有新行，就不匹配。是否有sed的修饰符，或者我可以做些什么来强制匹配任何字符，包括换行符？

我知道正则表达式在XML和HTML上很糟糕，等等等等，但在这种情况下，字符串模式的格式很好，因为样式属性总是以单引号开头并以单引号结束。因此，如果我能解决换行问题，只需用一个命令就可以将HTML的大小减少50％以上。

最后，事实证明，SinanÜnür的perl脚本效果最好。它几乎是瞬间的，它将文件大小从2.3 MB减少到850k。好的'Perl ......

Answer 1

Sed逐行读取输入，因此在一行上处理并不简单......但这也不是不可能的，你需要使用sed分支。以下是可行的，我已经评论它来解释发生了什么（不是最可读的语法！）：

sed "# if the line matches 'style='', then branch to label, 
     # otherwise process next line
     /style='/b style
     b
     # the line contains 'style', try to do a replace
     : style
     s/ style='[^']*'//
     # if the replace worked, then process next line
     t
     # otherwise append the next line to the pattern space and try again.
     N
     b style
 " fileA > fileB

Answer 2

sed逐行检查输入文件，这意味着，据我所知，sed无法实现您想要的内容。

您可以使用以下Perl脚本（未经测试的）：

#!/usr/bin/perl

use strict;
use warnings;

{
    local $/; # slurp mode
    my $html = <>;
    $html =~ s/ style='[^']*'//g;
    print $html;
}

__END__

一个班轮将是：

$ perl -e 'local $/; $_ = <>; s/ style=\047[^\047]*\047//g; print' fileA > fileB

Answer 3

您可以使用tr删除所有CR / LF，运行sed，然后导入自动格式化的编辑器。

Answer 4

你可以试试这个：

awk '/style/&&/exampleValue/{
    gsub(/style.*exampleValue\047/,"")
}
/style/&&!/exampleValue/{     
    gsub(/style.* /,"")
    f=1        
}
f &&/exampleValue/{  
  gsub(/.*exampleValue\047 /,"")
  f=0
}
1
' file

输出：

# more file
this is a line
    style='text-align:center; color:blue; exampleStyle:exampleValue'
this is a line
blah
blah
style='text-align:center; color:blue;
exampleStyle:exampleValue' blah blah....

# ./test.sh
this is a line

this is a line
blah
blah
blah blah....

Answer 5

另一种方式是：

$ cat toreplace.txt 
I want to make \
this into one line

I also want to \
merge this line

$ sed -e 'N;N;s/\\\n//g;P;D;' toreplace.txt

输出：

I want to make this into one line

I also want to merge this line

N加载另一行，P将图案空间打印到第一个换行符，D删除图案空间直到第一个换行符。

Answer 6

跨多行删除XML元素

我的用例几乎相同，但是我需要匹配XML元素中的开始和结束标签，并完全删除它们-包括内部内容。

<xmlTag whatever="parameter that holds in the tag header">
    <whatever_is_inside/>
    <InWhicheverFormat>
        <AcrossSeveralLines/>
    </InWhicheverFormat>
</xmlTag>

不过，sed只能在一行上使用。我们在这里做的是欺骗它，以便将后续行添加到当前行，以便我们可以编辑所需的所有行，然后重写输出（\n是合法的字符，您可以使用sed输出以分隔行再次）。

受@beano和another answer in Unix stackExchange的回答启发，我建立了工作的sed“程序”：

 sed -s --in-place=.back -e '/\(^[ ]*\)<xmlTag/{  # whenever you encounter the xmlTag
       $! {                                       # do
            :begin                                # label to return to
            N;                                    # append next line
            s/\(^[ ]*\)<\(xmlTag\)[^·]\+<\/\2>//; # Attempt substitution (elimination) of pattern
            t end                                 # if substitution succeeds, jump to :end
            b begin                               # unconditional jump to :begin to append yet another line
            :end                                  # label to mark the end
          }
       }'  myxmlfile.xml

一些解释：

我匹配<xmlTag而不关闭>，因为我的XML元素包含参数。
<xmlTag之前的内容对RegExp非常有用，它可以匹配任何现有的缩进：\(^[ ]*\)，因此您以后可以仅使用\1将其输出（甚至如果这次不需要的话。
在多个位置添加;是为了使sed能够理解命令（N，s或其中任何一个）在此结束并跟随字符是另一个命令。
我最大的麻烦是试图找到一个与“介于两者之间”的匹配的RegExp。我终于以·（即[^·]\+）之外的任何事物来解决，指望在任何数据文件中都没有该字符。我需要使用+，因为它对GNU sed很特殊。
我的原始文件仍保留为.back，以防万一出现问题（修改后测试仍然失败），并且易于版本控制将其标记为批量删除。

我使用这种sed自动化来演化.XML文件，该文件与序列化数据一起用于运行单元测试和集成测试。每当我们的班级发生变化（松散或增加字段）时，就必须更新数据。我使用一个“ find”来执行此操作，该“ find”在包含已修改类的文件中执行sed自动化。我们拥有数百个xml数据文件。

匹配sed中的任何字符（包括换行符）

6 个答案:

跨多行删除XML元素