Question

我有一个包含一些链接的XML文件

<SupportingDocs>
<LinkedFile>http://llcorp/ll/lljomet.dll/open/864606</LinkedFile>
<LinkedFile>http://llcorp/ll/lljomet.dll/open/1860632</LinkedFile>
<LinkedFile>%20http%3A%2F%2Fllenglish%2Fll%2Fll.exe%2Fopen%2F927515</LinkedFile>
<LinkedFile>%20http%3A%2F%2Fllenglish%2Fll%2Fll.exe%2Fopen%2F973783</LinkedFile>
</SupportingDocs>

我正在使用正则表达式＆＃34; \＆lt; [^ \＆lt;＆gt;] +＆gt;（？：https？：// | www。）[^ \＆lt;＆gt;] + \＆lt; / [^ \＆LT;＆GT;] +＆GT;＆＃34;并使用c＃var matches = MyParser.Matches(FormXml); 但它匹配前两个链接，但不匹配编码的链接。

我们如何使用RegEx匹配URL编码的链接？

Answer 1

这是一个可能有帮助的片段。我真的质疑你是否使用最好的方法，所以我做了一些假设（也许你还没有给出足够的细节）。

我将xml解析为XmlDocument以在代码中使用它。拉出相关标签（＆＃34; LinkedFile＆＃34;）。每个标记都被解析为Uri。如果失败，则将其取消转义并再次尝试解析。最后将是一个包含正确解析的URL的字符串列表。如果你真的需要，你可以在这个集合上使用你的正则表达式。

// this is for the interactive console
#r "System.Xml.Linq"
using System.Xml;
using System.Xml.Linq;

// sample data, as provided in the post.
string rawXml = "<SupportingDocs><LinkedFile>http://llcorp/ll/lljomet.dll/open/864606</LinkedFile><LinkedFile>http://llcorp/ll/lljomet.dll/open/1860632</LinkedFile><LinkedFile>%20http%3A%2F%2Fllenglish%2Fll%2Fll.exe%2Fopen%2F927515</LinkedFile><LinkedFile>%20http%3A%2F%2Fllenglish%2Fll%2Fll.exe%2Fopen%2F973783</LinkedFile></SupportingDocs>";
var xdoc = new XmlDocument();
xdoc.LoadXml(rawXml)

// will store urls that parse correctly
var foundUrls = new List<String>();

// temp object used to parse urls
Uri uriResult;

foreach (XmlElement node in xdoc.GetElementsByTagName("LinkedFile"))
{
    var text = node.InnerText;

    // first parse attempt
    var result = Uri.TryCreate(text, UriKind.Absolute, out uriResult);

    // any valid Uri will parse here, so limit to http and https protocols
    // see https://stackoverflow.com/a/7581824/1462295
    if (result && (uriResult.Scheme == Uri.UriSchemeHttp || uriResult.Scheme == Uri.UriSchemeHttps))
    {
        foundUrls.Add(uriResult.ToString());
    }
    else
    {
        // The above didn't parse, so check if this is an encoded string.
        // There might be leading/trailing whitespace, so fix that too
        result = Uri.TryCreate(Uri.UnescapeDataString(text).Trim(), UriKind.Absolute, out uriResult);

        // see comments above
        if (result && (uriResult.Scheme == Uri.UriSchemeHttp || uriResult.Scheme == Uri.UriSchemeHttps))
        {
            foundUrls.Add(uriResult.ToString());
        }
    }
}

// interactive output:
> foundUrls
List<string>(4) { "http://llcorp/ll/lljomet.dll/open/864606", "http://llcorp/ll/lljomet.dll/open/1860632", "http://llenglish/ll/ll.exe/open/927515", "http://llenglish/ll/ll.exe/open/973783" }

匹配RegEx for Url编码链接C＃

1 个答案: