我正在尝试使用VBA解析Website。我的目标是提取公司的数据(名称,网站,位置,投资者,投资日期,投资金额等)并将其存储在Excel表格中。我提取HTML源代码的功能如下(它起作用,至少在我的计算机上......)。我使用后期绑定来获得更好的可移植性。
Private Sub getHTMLFromURL()
Dim objHTTP As Object
Dim html As Object
Dim implTmp As Object
Set objHTTP = VBA.CreateObject("MSXML2.XMLHTTP")
Set html = CreateObject("htmlfile")
on error GoTo endProgram
With objHTTP
.Open "GET", URLStr, False
.send
If .READYSTATE = 4 And .Status = 200 Then
html.Open
html.write .responseText
html.Close
Else
Debug.Print "Error" & vbNewLine & "Ready state: " & .READYSTATE & _
vbNewLine & "HTTP request status: " & .Status
GoTo endProgram
End If
End With
endProgram:
Set html = Nothing
Set objHTTP = Nothing
If Err <> 0 Then
Debug.Print "Error in getHTMLFromURL " & Err.Number & " - " & Err.Description
End If
End Sub
我的问题是HTML是一个隐藏数据的巨大SCRIPT标记。例如,我想公司的总部位置在这行代码中:
{"key":"hqLocations","name":"Location","sortable":false,"getText":"function getText(c) {\n return getHQCity(c.hqLocations);\n }"}
我谦卑地承认我完全不了解如何获取这些数据。我在多个论坛上搜索过很多但没有找到合适的答案。我试图调整this method但没有成功。因此,我有几个与我的问题相关的问题:
非常感谢
答案 0 :(得分:0)
尝试以下脚本。当您运行它时,您应该获得您在帖子中请求的所需数据:
Sub Fetch_Data()
Dim IE As New InternetExplorer, HTML As HTMLDocument
Dim posts As Object, post As Object, hdata As Object
Dim elem As Object, trow As Object
With IE
.Visible = False
.navigate "http://app.startupeuropeclub.eu/companies/pld_space"
While .Busy = True Or .readyState < 4: DoEvents: Wend
Set HTML = .document
End With
''the following line is let the scraper wait until the data is completely loaded
Do: Set hdata = HTML.getElementsByClassName("field"): DoEvents: Loop Until hdata.Length > 1
For Each post In hdata
With post.getElementsByClassName("title")
If .Length Then R = R + 1: Cells(R, 1) = .Item(0).innerText
End With
With post.getElementsByClassName("description")
If .Length Then Cells(R, 2) = .Item(0).innerText
End With
Next post
''avoiding hardcoded delay and wait until the data is completely loaded
Do: Set posts = HTML.querySelector(".card-content.with-padding table.simple-table"): DoEvents: Loop While posts Is Nothing
For Each elem In posts.Rows
For Each trow In elem.Cells
C = C + 1: Cells(I + 10, C) = trow.innerText
Next trow
C = 0
I = I + 1
Next elem
IE.Quit
End Sub
参考添加到库:
Microsoft Internet Controls
Microsoft HTML Object Library