我有以下XML标记并尝试转换为PIPE分隔文件,但多行内容文本没有填充。任何帮助都会很棒。
输入XML标记
<entry>
<updated>2014-02-14T12:44:00-07:00</updated>
<id>943474234</id>
<title>sw eng</title>
<content type="text">this constantly crashes on 5S.
option volume is inconsistent. it shows something in the option list and something else when
getting the detail.
option should volume should not show in terms of 'K', 8.7K should be 8700.
the new version has many bugs and is frustrating.
:-(</content>
<im:voteSum>0</im:voteSum>
<im:voteCount>0</im:voteCount>
<im:rating>2</im:rating>
<im:version>3.3.0.122</im:version>
</entry>
预期产出
2014-02-14T12:44:00-07:00|943474234|sw eng|this constantly crashes on 5S. option volume is inconsistent. it shows something in the option list and something else when getting the detail. option should volume should not show in terms of 'K', 8.7K should be 8700.the new version has many bugs and is frustrating.|0|0|2|3.3.0.122|
答案 0 :(得分:0)
如果您使用的是python,使用BeautifulSoup完成这项工作会更方便。
Bash并不擅长xml
我认为,但对于这个问题,以下代码可能有所帮助。
tr
将所有行转换为tr -d '\n'
。sed
将所有xml
代码转换为|
<{1}} sed -r 's/<[^<>]*>/|/g'
将多sed
转换为|
sed -r 's/\|+/\|/g'
并将|
替换为|
\n
sed -r 's/\|+/\|/g'|sed -e 's/^|//' -e 's/|$/\n/'
给你,留给你)假设输入xml位于名为:-(
的文件中,处理此问题的总命令将为xml_in
答案 1 :(得分:0)
如何做到这一点,你应该可以删除头部/结尾|
。
sed 's/<[^>]*>/|/g' file |xargs |sed 's/| |/|/g'
|2014-02-14T12:44:00-07:00|943474234|sw eng|this constantly crashes on 5S. option volume is inconsistent. it shows something in the option list and something else when getting the detail. option should volume should not show in terms of K, 8.7K should be 8700. the new version has many bugs and is frustrating. :-(|0|0|2|3.3.0.122|
答案 2 :(得分:0)
基于 XML解析器的强大解决方案, xmlstarlet
:
xml sel -B -t -m '/*/*' -v 'concat(normalize-space(text()),"|")' file 2>/dev/null
sel
是xmlstarlet
的选择(提取)命令-B
删除无关紧要的空格。-t -m '/*/*' -v 'concat(normalize-space(text()),"|")'
是在应用之前内部转换为XSLT文档的提取命令(在-C
之前查看该文档)。
-m '/*/*'
匹配第二级的所有元素(在这种情况下为entry
的子级)。-v 'concat(normalize-space(text()),"|")'
根据XSLT函数从匹配元素中提取值:text()
表示每个匹配节点的文本内容,normalize-space()
规范化内部空白(压缩多个空格,用a替换换行符)每个单独的空格,concat()
用于将|
附加到每个值)2>/dev/null
可以抑制因使用名称空间前缀im:
而导致的错误消息,而无需声明相应的名称空间。获取xmlstarlet
:
brew install xmlstarlet
sudo apt-get install xmlstarlet