Bash - 从xml文件中提取url

时间:2012-10-02 10:14:56

标签: bash sed grep

我有这个文件(dev1.temp):

 <?xml version="1.0" encoding="UTF-8"?>
<krpano version="1.0.8.15" showerrors="false">

          <include url="include/sa/index.xml" /> <include url="content/sa.xml" />
          <include url="include/global/index.xml" />
          <include url="include/orientation/index.xml" />
          <include url="include/movecamera/index.xml" /> <include url="content/movecamera.xml" />
          <include url="include/fullscreen/index.xml" />
          <include url="include/instructions/index.xml" />
          <include url="include/coordfinder/index.xml" />
          <include url="include/editor_and_options/index.xml" />
</krpano>

目标是获取所有url的内容并将它们放在临时文件(devel.temp)中。输出将是:

include/sa/index.xml
content/sa.xml
include/global/index.xml
include/orientation/index.xml
include/movecamera/index.xml
content/movecamera.xml
include/fullscreen/index.xml
include/instructions/index.xml
include/coordfinder/index.xml
include/editor_and_options/index.xml

为了做到这一点,我有以下脚本:

# Make a temp file with all the files url's    
grep -o 'url=['"'"'"][^"'"'"']*['"'"'"]' $temp_folder"/devel1.temp" > $temp_folder"/devel2.temp"
# Strip off everything to leave just the url's'    
sed -e 's/^url=["'"'"']//' -e 's/["'"'"']$//' $temp_folder"/devel2.temp" > $temp_folder"/devel.temp"

昨天它完美无缺。今天,devel2.temp和devel.temp输出是这样的:

[01;31m[Kurl="include/sa/index.xml"[m[K
[01;31m[Kurl="content/sa.xml"[m[K
[01;31m[Kurl="include/global/index.xml"[m[K
[01;31m[Kurl="include/orientation/index.xml"[m[K
[01;31m[Kurl="include/movecamera/index.xml"[m[K
[01;31m[Kurl="content/movecamera.xml"[m[K
[01;31m[Kurl="include/fullscreen/index.xml"[m[K
[01;31m[Kurl="include/instructions/index.xml"[m[K
[01;31m[Kurl="include/coordfinder/index.xml"[m[K
[01;31m[Kurl="include/editor_and_options/index.xml"[m[K

有关正在发生的事情的任何想法?

4 个答案:

答案 0 :(得分:2)

似乎grep使用ANSI序列为其输出着色,即使输出不是终端。将其--coloralways更改为auto

您应该使用支持XML的工具,而不是使用grep来处理XML。例如,在xsh中,您可以编写

open file.xml ;
perl { use Term::ANSIColor } ;
for /krpano/include
    echo :s { color('bright_yellow') }
            @url
            { color('reset') } ;

答案 1 :(得分:2)

除了 choroba 的评论之外。你的ANSI序列,我会避免在可能的情况下通过sed等解析XML,并期望使用支持XML的脚本工具。我使用XMLStarlet toolkit。这意味着你的脚本是字符编码/实体感知的,并且在更改XML时会更加健壮。

答案 2 :(得分:2)

考虑使用xml目标工具,例如xpath。我建议这个:

xpath -e "/krpano/include/@url" -q yourFile.xml | cut -f 2 -d "=" | sed 's/"//

如果您确定xml的krpano根将include只有url属性。你也可以使用下面的速记,但上面的速度会更快。

xpath -e "//@url" -q yourFile.xml | cut -f 2 -d "=" | sed 's/"//

答案 3 :(得分:1)

第三个xml感知脚本工具是我的Xidel

xidel /tmp/your.xml -e //@url

(与大多数情况相反,它支持XPath 2.0,虽然这对这个问题来说太过分了)