如何使用VBA在HTML页面上获取脚本生成的数据?

时间:2018-02-22 10:12:40

标签: javascript html vba parsing xmlhttprequest

我正在尝试使用VBA解析Website。我的目标是提取公司的数据(名称,网站,位置,投资者,投资日期,投资金额等)并将其存储在Excel表格中。我提取HTML源代码的功能如下(它起作用,至少在我的计算机上......)。我使用后期绑定来获得更好的可移植性。

Private Sub getHTMLFromURL()

    Dim objHTTP As Object
    Dim html As Object
    Dim implTmp As Object

    Set objHTTP = VBA.CreateObject("MSXML2.XMLHTTP")
    Set html = CreateObject("htmlfile")

    on error GoTo endProgram

    With objHTTP
        .Open "GET", URLStr, False
        .send

        If .READYSTATE = 4 And .Status = 200 Then
            html.Open
            html.write .responseText
            html.Close
        Else
            Debug.Print "Error" & vbNewLine & "Ready state: " & .READYSTATE & _
            vbNewLine & "HTTP request status: " & .Status
            GoTo endProgram
        End If

    End With

endProgram:
    Set html = Nothing
    Set objHTTP = Nothing
    If Err <> 0 Then
        Debug.Print "Error in getHTMLFromURL " & Err.Number & " - " & Err.Description
    End If

End Sub

我的问题是HTML是一个隐藏数据的巨大SCRIPT标记。例如,我想公司的总部位置在这行代码中:

{"key":"hqLocations","name":"Location","sortable":false,"getText":"function getText(c) {\n    return getHQCity(c.hqLocations);\n  }"}

我谦卑地承认我完全不了解如何获取这些数据。我在多个论坛上搜索过很多但没有找到合适的答案。我试图调整this method但没有成功。因此,我有几个与我的问题相关的问题:

  • 使用MSXML2.XMLHTTP是最好的方法吗?
  • 我是否必须捕获每个变量,或者有没有办法直接解释脚本以获得包含所有数据的HTML(这会更容易)?
  • 否则,我如何提取所有数据(单个数据和数组)?

非常感谢

1 个答案:

答案 0 :(得分:0)

尝试以下脚本。当您运行它时,您应该获得您在帖子中请求的所需数据:

Sub Fetch_Data()
    Dim IE As New InternetExplorer, HTML As HTMLDocument
    Dim posts As Object, post As Object, hdata As Object
    Dim elem As Object, trow As Object

    With IE
        .Visible = False
        .navigate "http://app.startupeuropeclub.eu/companies/pld_space"
        While .Busy = True Or .readyState < 4: DoEvents: Wend
        Set HTML = .document
    End With

    ''the following line is let the scraper wait until the data is completely loaded
    Do: Set hdata = HTML.getElementsByClassName("field"): DoEvents: Loop Until hdata.Length > 1

    For Each post In hdata
        With post.getElementsByClassName("title")
            If .Length Then R = R + 1: Cells(R, 1) = .Item(0).innerText
        End With
        With post.getElementsByClassName("description")
            If .Length Then Cells(R, 2) = .Item(0).innerText
        End With
    Next post

    ''avoiding hardcoded delay and wait until the data is completely loaded
    Do: Set posts = HTML.querySelector(".card-content.with-padding table.simple-table"): DoEvents: Loop While posts Is Nothing

    For Each elem In posts.Rows
        For Each trow In elem.Cells
            C = C + 1: Cells(I + 10, C) = trow.innerText
        Next trow
        C = 0
        I = I + 1
    Next elem
    IE.Quit
End Sub

参考添加到库:

Microsoft Internet Controls
Microsoft HTML Object Library