将JavaScript生成的Web内容刮到Excel

时间:2015-10-04 23:42:29

标签: excel vba

我正在使用VBA和MSXML抓取一些网页内容,所以我知道基础知识。但现在我想从JavaScript生成的网页获取数据。 我不能给你确切的链接,因为它是私有的,但我可以描述它 - 基本上,有标题和一些图像的div容器,下面是表格,动态加载(圆圈),但不更新(所以他们只加载一次)。如果在浏览器中打开源代码视图,则无法找到这些表,只能找到容器和标题/ src图像。但是如果你点击表并选择“检查元素”,你就可以看到<th <tr> <td>等的典型结构。 方法我知道:

1)保存页面然后刮掉它 - 可能不是最好的解决方案。

如果我有一个网址列表,有没有快速的方法来保存所有网页?

2)通过VBA使用Internet Explorer控件,等到页面加载然后像往常一样获取元素 - 但对我来说似乎很慢(?) - 就像一页上的25秒一样,即使它加载了0.5秒。

也许我应该关闭一些减慢装载的东西? 你能检查一下是什么问题吗?

以下是我找到的代码:

Sub FuturesScrap3(ByVal URL As String)

Dim HTMLDoc As New HTMLDocument
Dim AnchorLinks As Object
Dim tdElements As Object
Dim tdElement As Object
Dim AnchorLink As Object
Dim lRow As Long
Dim oElement As Object

Dim oIE As InternetExplorer

Set oIE = New InternetExplorer

oIE.navigate URL
oIE.Visible = True

Do Until (oIE.readyState = 4 And Not oIE.Busy)
    DoEvents
Loop

'Wait for Javascript to run
Application.Wait (Now + TimeValue("0:01:00"))

HTMLDoc.body.innerHTML = oIE.document.body.innerHTML

With HTMLDoc.body
    Set AnchorLinks = .getElementsByTagName("a")
    Set tdElements = .getElementsByTagName("td") '

    For Each AnchorLink In AnchorLinks
        Debug.Print AnchorLink.innerText
    Next AnchorLink

End With

lRow = 1
For Each tdElement In tdElements
    Debug.Print tdElement.innerText
    Cells(lRow, 1).Value = tdElement.innerText
    lRow = lRow + 1
Next

'Clicking the Month tab
For Each oElement In oIE.document.all
    If Trim(oElement.innerText) = "Month" Then
        oElement.Focus
        oElement.Click
    End If
Next oElement

Do Until (oIE.readyState = 4 And Not oIE.Busy)
    DoEvents
Loop

'Wait for Javascript to run
Application.Wait (Now + TimeValue("0:01:00"))

HTMLDoc.body.innerHTML = oIE.document.body.innerHTML

With HTMLDoc.body
    Set AnchorLinks = .getElementsByTagName("a")
    Set tdElements = .getElementsByTagName("td") '

    For Each AnchorLink In AnchorLinks
        Debug.Print AnchorLink.innerText
    Next AnchorLink
End With

lRow = 1
For Each tdElement In tdElements
    Debug.Print tdElement.innerText
    Cells(lRow, 2).Value = tdElement.innerText
    lRow = lRow + 1
Next tdElement End sub

3)使用像Selenium这样的网络驱动程序 - 找不到合适的例子。如果你从头开始给我一些,比如从classname中获取数据,就会很棒。

4)我不知道,但可能是最快的 - 直接从用于构建这些表的JS变量/数组中获取数据。我听说你可以用VBA连接VBA但是没有找到任何正确的例子来获取数据。

所有解决方案都应在VBA范围内。我想知道最快的方法是什么。

1 个答案:

答案 0 :(得分:0)

感谢您的评论。 @Marc,不,不可能使用网络查询/电源查询&#34;从网络&#34;导入数据,只有标题。

我编写了一些代码 - 有1分钟(!)延迟(当他在页面上添加延迟加载脚本时可能会犯错误。)

Sub FuturesScrap3(ByVal URL As String)

Dim HTMLDoc As New HTMLDocument
Dim AnchorLinks As Object
Dim tdElements As Object
Dim tdElement As Object
Dim AnchorLink As Object
Dim lRow As Long
Dim oElement As Object

Dim oIE As InternetExplorer

Set oIE = New InternetExplorer

oIE.navigate URL
oIE.Visible = True

Do Until (oIE.readyState = 4 And Not oIE.Busy)
    DoEvents
Loop

'Wait for Javascript to run - 1 second is enough in my case
Application.Wait (Now + TimeValue("0:00:01"))

HTMLDoc.body.innerHTML = oIE.document.body.innerHTML

With HTMLDoc.body
    Set AnchorLinks = .getElementsByTagName("a")
    Set tdElements = .getElementsByTagName("td") '

    For Each AnchorLink In AnchorLinks
        Debug.Print AnchorLink.innerText
    Next AnchorLink

End With

lRow = 1
For Each tdElement In tdElements
    Debug.Print tdElement.innerText
    Cells(lRow, 1).Value = tdElement.innerText
    lRow = lRow + 1
Next

'Clicking the Month tab
For Each oElement In oIE.document.all
    If Trim(oElement.innerText) = "Month" Then
        oElement.Focus
        oElement.Click
    End If
Next oElement

Do Until (oIE.readyState = 4 And Not oIE.Busy)
    DoEvents
Loop


HTMLDoc.body.innerHTML = oIE.document.body.innerHTML

With HTMLDoc.body
    Set AnchorLinks = .getElementsByTagName("a")
    Set tdElements = .getElementsByTagName("td") '

    For Each AnchorLink In AnchorLinks
        Debug.Print AnchorLink.innerText
    Next AnchorLink
End With

lRow = 1
For Each tdElement In tdElements
    Debug.Print tdElement.innerText
    Cells(lRow, 2).Value = tdElement.innerText
    lRow = lRow + 1
Next tdElement 
End sub