Question

我正在尝试解析Excel单元格中的html文本并删除一些部分。该文本可以包含不同的span样式，URL，类。我想最简单的方法是RegEx。

我有六种类型的链接（例如。当然，它们可能具有不同的属性和值）：

2没有锚点且没有<img>（应该被选中）

<a href="/"></a>
<a href="/"></a>

2没有锚点和<img>（不应该被选中）

<a href="/" title=""><img class="cars"></a>
<a href="/" title=""><img class="cars"></a>

和2与锚（不应选择）

<a href="/"><span style="color: #000000;">Cars</span></a>
<a href="/">Cars</a>

我应该使用哪种RegEx模式来标记没有锚点且只有<img>的2个链接？

我已经构建了模式

<a href=".*">(?!<img ".*">)(?:<\/span>)?<\/a>

标记了两种类型的链接：

<a href="/" title=""><img class="cars"></a>
<a href="/" title=""><img class="cars"></a>

包含<img>代码。

但如果删除<img>标记中的引号：

<a href="/" title=""><img class=cars></a>

它可以正常工作。

VBA代码：

Public Function txtrpl(ByRef x As String) As String`<br>

    With CreateObject("VBScript.RegExp")`<br>
        .Global = True`<br>
        .Pattern = "<a href="".*"">(?!<img "".*"">)(?:<\/span>)?<\/a>"`<br>
        txtrpl= Trim$(.Replace(x, ""))`<br>
    End With

End Function

Answer 1

如果您将使用正则表达式考虑没有的解决方案，那么您可以使用HTMLDocument对象。

您可以在VBE中添加引用（Microsoft HTML Object Library）以获取此库，然后使用早期绑定。或者，对于我下面的示例代码，只需使用后期绑定：

Dim objHtml As Object
Set objHtml = CreateObject("htmlfile")

我的示例传递一个字符串来创建HTMLDocument，您需要根据this接受的答案使用后期绑定。

无论如何，您可以使用HTMLDocument对象的方法和属性来检查DOM - 我已经使用下面的getElementsByTagName，innerText和innerHTML来获取你感兴趣的两个标签。例如：

' we want a tags without anchors and without img
For Each objElement In objElements
    ' innerText = "" is no anchor
    If objElement.innerText = "" Then
        ' check for <img in innerHtml to avoid a tags with an image
        If InStr(1, objElement.innerHtml, "<IMG", vbTextCompare) = 0 Then
            Debug.Print objElement.outerHTML
        End If
    End If
Next objElement

完整示例：

Option Explicit

Sub ParseATags()

    Dim strHtml As String

    strHtml = ""
    strHtml = strHtml & "<html>"
    strHtml = strHtml & "<body>"
    ' 2 without anchors and without <img>
    strHtml = strHtml & "<a href=""/""><span style=""color: #000000;""></span></a>"
    strHtml = strHtml & "<a href=""/""></a>"
    ' 2 without anchors and with <img>
    strHtml = strHtml & "<a href=""/"" title=""""><span style=""color: #000000;""></span><img class=""cars""></a>"
    strHtml = strHtml & "<a href=""/"" title=""""><img class=""cars""></a>"
    ' and 2 with anchors
    strHtml = strHtml & "<a href=""/""><span style=""color: #000000;"">Cars</span></a><br>"
    strHtml = strHtml & "<a href=""/"">Cars</a><br>"
    strHtml = strHtml & "</body>"
    strHtml = strHtml & "</html>"

    ' must use late binding
    ' https://stackoverflow.com/questions/9995257/mshtml-createdocumentfromstring-instead-of-createdocumentfromurl
    Dim objHtml As Object
    Set objHtml = CreateObject("htmlfile")

    ' add html
    With objHtml
        .Open
        .write strHtml
        .Close
    End With

    ' now parse the document
    Dim objElements As Object, objElement As Object

    ' get the <a> tags
    Set objElements = objHtml.getElementsByTagName("a")

    ' we want a tags without anchors and without img
    For Each objElement In objElements
        ' innerText = "" is no anchor
        If objElement.innerText = "" Then
            ' check for <img in innerHtml to avoid a tags with an image
            If InStr(1, objElement.innerHtml, "<IMG", vbTextCompare) = 0 Then
                Debug.Print objElement.outerHTML
            End If
        End If
    Next objElement

End Sub

您可能正在使用IE自动化或其他方式从网页上抓取此HTML。在这种情况下，使用早期绑定方法很有用，因为您将获得HTMLDocument对象和方法等的智能感知。

我很感激我的评论（使用正则表达式解析HTML的SO经典答案）可能看起来很粗鲁。然而，它充满了困难，而且往往只是徒劳无功。

如果您希望沿着这条路走下去，希望这种方法为您提供另一种选择。

RegEx模式，用于标记除<img/>之外的链接的空锚点

1 个答案: