无法以表格格式获取数据

时间:2018-06-09 15:08:35

标签: vba excel-vba internet-explorer web-scraping excel

我已经使用IE在vba中编写了一个脚本来从网页获取数据。数据不存储在任何表格中,我的意思是没有tabletrtd标记。但是,它们看起来像是表格格式。为清晰起见,您可以看到下图。

我到目前为止所尝试的内容可以将数据放在一行中,如:

$4,085  
$1,620
$1,435  
$35
$1,125  
$905

我希望如何得到它们就像:

$4,085  $1,620
$1,435  $35
$1,125  $905

在其他语言中,list comprehension使用了一个选项,我可以在一行代码中处理它,但是在vba的情况下我会卡住。

数据所在的

html elements(它只是整体的一大块):

<ul id="tco_detail_data">
    <li>
        <ul class="list-title">
            <li class="first">&nbsp;</li>
            <li>Year 1</li>
            <li>Year 2</li>
            <li>Year 3</li>
            <li>Year 4</li>
            <li>Year 5</li>
            <li class="last">5 Yr Total</li>
        </ul>
    </li>
    <hr class="loose-dotted">


    <li class="first">
        <ul class="first">
            <li class="first">Depreciation</li>
                        <li>$4,085</li>
                        <li>$1,620</li>
                        <li>$1,425</li>
                        <li>$1,263</li>
                        <li>$1,133</li>
                    <li class="last">$9,526</li>
        </ul>
    </li>
</ul>

数据在该页面中显示:

enter image description here

这是我到目前为止所尝试的:

Sub Get_Information()
    Dim IE As New InternetExplorer, HTML As HTMLDocument
    Dim post As Object

    With IE
        .Visible = False
        .Navigate "https://www.edmunds.com/ford/escape/2017/cost-to-own/?zip=43215"
        While .Busy = True Or .ReadyState < 4: DoEvents: Wend
        Set HTML = .Document
    End With

    Application.Wait Now + TimeValue("00:00:05") 'waiting for the items to be available

    For Each post In HTML.getElementById("tco_detail_data").getElementsByTagName("li")
        Debug.Print post.innerText
    Next post
    IE.Quit
End Sub

引用添加到库以执行上述脚本:

Microsoft Internet Controls
Microsoft HTML Object Library

2 个答案:

答案 0 :(得分:3)

这可以使用CSS选择器。已更新以删除显式等待。

选择器是:

#tco_detail_data > li

li

的id中tco_detail_data

以下示例结果来自使用CSS查询的网页

CSS query

<强>代码:

Option Explicit
Public Sub Get_Information()
    Dim IE As New InternetExplorer

    With IE
        .Visible = False
        .navigate "https://www.edmunds.com/ford/escape/2017/cost-to-own/?zip=43215"
        While .Busy = True Or .readyState < 4: DoEvents: Wend
    End With
    Dim a As Object, exitTime As Date
    exitTime = Now + TimeSerial(0, 0, 5)

    Do
        DoEvents
        On Error Resume Next
        Set a = IE.document.querySelectorAll("#tco_detail_data")
        On Error GoTo 0
        If Now > exitTime Then Exit Do
    Loop While a Is Nothing

    If a Is Nothing Then Exit Sub

    Dim resultsNodeList As Object, i As Long, arr() As String
    Set resultsNodeList = HTML.querySelectorAll("#tco_detail_data > li")

    With ActiveSheet
        For i = 0 To 9
            arr = Split(resultsNodeList(i).innerText, Chr$(10))
            .Cells(i + 1, 1).Resize(1, UBound(arr) + 1).Value = arr
        Next
    End With

    IE.Quit
End Sub

表单中的结果

Result

其他信息:

数组部分是因为resultsNodeList(i).innerText返回为&#34;堆叠字符串&#34; - 即两者之间有断线;见下图。我拆分了这些,以生成一个数组,然后我写出来。该数组是基于0的,所以我必须添加1才能正确填充范围。

unsplit strings

答案 1 :(得分:2)

除了QHarr已经展示的内容之外,还有另一种方法可以达到同样的目标:

Sub Get_Information()
    Dim IE As New InternetExplorer, HTML As HTMLDocument
    Dim posts As Object, post As Object, oitem As Object
    Dim R&, C&, B As Boolean

    With IE
        .Visible = False
        .Navigate "https://www.edmunds.com/ford/escape/2017/cost-to-own/?zip=43215"
        Do While .Busy = True Or .ReadyState <> 4: DoEvents: Loop
        Set HTML = .Document
    End With

    ''no hardcoded delay is required. The following line should take care of that

    Do: Set oitem = HTML.getElementById("tco_detail_data"): DoEvents: Loop While oitem Is Nothing

    For Each posts In oitem.getElementsByTagName("li")
        C = 1: B = False

        For Each post In posts.getElementsByTagName("li")
            Cells(R + 1, C).Value = post.innerText
            C = C + 1: B = True
        Next post

        If B Then R = R + 1
    Next posts
    IE.Quit
End Sub