应用错误收集

使用Bash脚本从字符串中获取有效的URL

时间：2013-12-10 08:17:11

标签： xml bash shell

我正在使用xmllint解析xml文件。每个description中有一个元素<item>，其中包含CDATA文本，我想从中提取标题（文本直到f <br />）和特定域的URL（desiredURL.com）。我不是regeular表达的专家，而是使用awk和sed。有没有办法再次使用xmllint解析description元素中的数据或什么是合适的方法？我想迭代所有<item>并打印所需域名的URL和url

#!/bin/bash
ITEMS=`echo "cat  //item/description/text()" | xmllint --shell  file.xml  | egrep '^\w'`
#iterate over items and print title and desiredURL


file.xml:

<item>
    <description><![CDATA[A title for the URLs<br /><br />

    http://www.foobar.com/foo/bar
    <br />http://bar.com/foo
    <br />http://myurl.com/foo
    <br />http://desiredURL.com/files/ddd
    <br />http://asdasd.com/onefile/g.html
    <br />http://second.com/link
    <br />]]></description> 



    </item>
<description> ...</description>
    <item>
</item>

1 个答案:

答案 0 :(得分：1)

XMLlint

您可以使用--xpath选项来传递XPath。

提取网址

假设您的网址未被跟踪每行上的任何内容，您可以将grep用于：

-P flag：Perl正则表达式（PCRE）;
-o flag：仅打印匹配的（非空）部分。

命令

xmllint --xpath '//item/description' /tmp/so.xml | grep -Po 'http:.*'