Question

我有一个很大的xhtml文件，里面有很多我不需要的垃圾文本。我只需要在该文件中多次出现的两个特定字符串之间的任何文本，例如

<html>
<xyz> unneeded text </xyz>
<mytag> important text1 </mytag>
<xyz> unneeded text </xyz>
<xyz> unneeded text </xyz>
<mytag> important text2 </mytag>
<mytag> important text3 </mytag>
<xyz> unneeded text </xyz>
</html>

我的输出应该是：

important text1
important text2
important text3

我需要使用Bash脚本来做到这一点。

感谢您的帮助

Answer 1

Using regex on Xml format is risky, particularly with line based text processing tool grep. You cannot make sure that the result is always correct.

If your input was valid xml format, I would go with xml way: xpath expression.

With tool xmlstarlet, you can do:

xmlstarlet sel -t -v "//mytag/text()" file.xml

It gives the desired output.

You can also do it with xmllint, however, you need do some further filtering on the output.

Answer 2

Using an XML parser would be the best way to go.

Solution using grep with PCRE:

grep -Po '^<mytag>\s*\K.*?(?=\s*</mytag>$)'

Example:

$ cat file.xml                                    
<html>
<xyz> unneeded text </xyz>
<mytag> important text1 </mytag>
<xyz> unneeded text </xyz>
<xyz> unneeded text </xyz>
<mytag> important text2 </mytag>
<mytag> important text3 </mytag>
<xyz> unneeded text </xyz>
</html>

$ grep -Po '^<mytag>\s*\K.*?(?=\s*</mytag>$)' file.xml
important text1
important text2
important text3

Answer 3

Using XML parser is a better approach, there are also command line tools for xml parsing in Linux, eg: xmllint but you can do it using grep like this:

$ cat data1 
<html>
<xyz> unneeded text </xyz>
<mytag> important text1 </mytag>
<xyz> unneeded text </xyz>
<xyz> unneeded text </xyz>
<mytag> important text2 </mytag>
<mytag> important text3 </mytag>
<xyz> unneeded text </xyz>
</html>
$ grep -oP '(?<=<mytag>).*(?=</mytag>)' data1
 important text1 
 important text2 
 important text3  
$

(?<=<mytag>).*(?=</mytag>) this extracts text using positive lookahead and lookbehind assertions

在bash中提取多次出现的2个特定字符串之间的文本

3 个答案: