VBA webscraper - 使用正则表达式返回InnerHTML

时间:2018-05-25 11:57:13

标签: regex vba

使用Excel VBA,我必须从此website中删除一些数据。

由于相关网站对象不包含id,因此我无法使用HTML.Document.GetElementById

但是,我注意到相关信息始终存储在<div>部分中,如下所示:

<div style="padding:7px 12px">Basler Versicherung AG &#214;zmen</div>

问题: 是否有可能构建一个RegExp,可能在一个循环中,返回<div style="padding:7px 12px">和下一个</div>内的内容?

到目前为止我所拥有的是容器的完整InnerHtml,显然我需要添加一些代码来遍历尚未构建的RegExp。

Private Function GetInnerHTML(url As String) As String
    Dim i As Long
    Dim Doc As Object
    Dim objElement As Object
    Dim objCollection As Object

On Error GoTo catch
   'Internet Explorer Object is already assigned
   With ie
        .Navigate url
        While .Busy
            DoEvents
        Wend
        GetInnerHTML = .document.getelementbyId("cphContent_sectionCoreProperties").innerHTML
    End With
    Exit Function
catch:
    GetInnerHTML = Err.Number & " " & Err.Description
End Function

2 个答案:

答案 0 :(得分:2)

我认为您不需要正则表达式来查找页面上的内容。您可以使用元素的相对位置来查找您所追求的我认为的内容。

<强>代码

Option Explicit

Public Sub GetContent()
    Dim URL     As String: URL = "https://www.uid.admin.ch/Detail.aspx?uid_id=CHE-105.805.649"
    Dim IE      As Object: Set IE = CreateObject("InternetExplorer.Application")
    Dim Labels  As Object
    Dim Label   As Variant
    Dim Values  As Variant: ReDim Values(0 To 1, 0 To 5000)
    Dim i       As Long

    With IE
        .Navigate URL
        .Visible = False

        'Load the page
        Do Until IE.busy = False And IE.readystate = 4
            DoEvents
        Loop

        'Find all labels in the table
        Set Labels = IE.document.getElementByID("cphContent_pnlDetails").getElementsByTagName("label")

        'Iterate the labels, then find the divs relative to these
        For Each Label In Labels
            Values(0, i) = Label.InnerText
            Values(1, i) = Label.NextSibling.Children(0).InnerText
            i = i + 1
        Next

    End With

    'Dump the values to Excel
    ReDim Preserve Values(0 To 1, 0 To i - 1)
    ThisWorkbook.Sheets(1).Range("A1:B" & i) = WorksheetFunction.Transpose(Values)

    'Close IE
    IE.Quit
End Sub

答案 1 :(得分:2)

使用XMLHTTP请求方法可以实现相同的另一种方法。试一试:

Sub Fetch_Data()
    Dim S$, I&

    With New XMLHTTP60
        .Open "GET", "https://www.uid.admin.ch/Detail.aspx?uid_id=CHE-105.805.649", False
        .send
        S = .responseText
    End With

    With New HTMLDocument
        .body.innerHTML = S
        With .querySelectorAll("#cphContent_sectionCoreProperties label[id^='cphContent_ct']")
            For I = 0 To .Length - 1
                Cells(I + 1, 1) = .Item(I).innerText
                Cells(I + 1, 2) = .Item(I).NextSibling.FirstChild.innerText
            Next I
        End With
    End With
End Sub

在执行上述脚本之前参考添加到库中:

Microsoft HTML Object Library
Microsoft XML, V6.0