Question

如果在其中提取的数据具有url，我的函数需要替换字符串中的标记。例如：

<a href=www.cnn.com>www.cnn.com</a>

将替换为：

 www.cnn.com

这很好但是我有一个字符串：

<a href=www.cnn.com><span style="color: rgb(255, 0, 0);">www.cnn.com</span></a>

我只得到：

www.cnn.com

当我真的想留下来时：

<span style="color: rgb(255, 0, 0);">www.cnn.com</span>

我需要在代码中添加什么才能使其正常工作？

这是我的功能：

Dim ret As String = text

'If it looks like a URL
Dim regURL As New Regex("(www|\.org\b|\.com\b|http)")
'Gets a Tags regex
Dim rxgATags = New Regex("<[^>]*>", RegexOptions.IgnoreCase) 

'Gets all matches of <a></a> and adds them to a list
Dim matches As MatchCollection = Regex.Matches(ret, "<a\b[^>]*>(.*?)</a>") 

'for each <a></a> in the text check it's content, if it looks like URL then delete the <a></a>
For Each m In matches
'tmpText holds the data extracted within the a tags. /visit at.../www.applyhere.com
        Dim tmpText = rxgATags.Replace(m.ToString, "")

        If regURL.IsMatch(tmpText) Then
            ret = ret.Replace(m.ToString, tmpText)
        End If
Next

Return ret

Answer 1

以下Regex将删除所有HTML标记：

string someString = "<a href=www.one.co.il><span style=\"color: rgb(255, 0, 255);\">www.visitus.com</span></a>";

string target = System.Text.RegularExpressions.Regex.Replace(someString, @"<[^>]*>", "", RegexOptions.Compiled).ToString();

这是您想要的正则表达式：<[^>]*>

我的代码结果：www.visitus.com

Answer 2

您可以使用以下正则表达式 - <a\s*[^<>]*>|</a> - 它将匹配所有<a>个节点，包括打开和关闭节点。

您无需使用regURL，这可以内置到rxATags正则表达式中。我们可以通过检查<a>和href`标记来确保它是引用网址regURL alternatives, then grab everything in between the opening and close的标记，然后只使用它们之间的内容。

Dim ret As String = "<a href=www.one.co.il><span style=""color: rgb(255, 0, 255);"">www.visitus.com</span></a>"
'Gets a Tags regex
Dim rxgATags = New Regex("(<a\s*[^<>]*href=[""']?(?:www|\.org\b|\.com\b|http)[^<>]*>)((?>\s*<(?<t>[\w.-]+)[^<>]*?>[^<>]*?</\k<t>>\s*)+)(</a>)", RegexOptions.IgnoreCase)
Dim replacement As String = "$2"
ret = rxgATags.Replace(ret, replacement)

enter image description here

Answer 3

我将此添加到我的代码中：

'Selects only the A tags without the data extracted between them
Dim rxgATagsOnly = New Regex("</?a\b[^>]*>", RegexOptions.IgnoreCase)

    For Each m In matches
        'tmpText holds the data extracted within the a tags. /visit at.../www.applyhere.com
        Dim tmpText = rxgATagsContent.Replace(m.ToString, "")

        'if the data extract between the tags looks like a URL then take off the a tags without touching the span tags.
        If regURL.IsMatch(tmpText) Then
            'select everything but a tags
            Dim noATagsStr As String = rxgATagsOnly.Replace(m.ToString, Environment.NewLine)
            'replaces string with a tag to non a tag string keeping it's span tags
            ret = ret.Replace(m.ToString, noATagsStr)

        End If
    Next

所以从字符串：

<a href=www.cnn.com><span style="color: rgb(255, 0, 0);">www.cnn.com</span></a>

我只选择了带有Avinash Raj正则表达式的标签然后用＆＃34;＆＃34;替换它们。谢谢大家回答。

vb.net正则表达式 - 替换标记而不替换span标记

3 个答案: