如何正确引用VBA中抓取的网站表格的表格?

时间:2018-10-14 08:29:36

标签: html excel vba web-scraping

我正在构建我的第一个VBA代码,以从网站中抓取数据。我可以打开站点并浏览按钮以在屏幕上获取正确的数据,但是我很难引用正确的表来遍历。我想访问嵌入式表“活动”。为此,我从here那里获得了有关如何循环访问表和提取信息的答案,并将其嵌入到我的代码中。以下是出现错误的三个区域。

这些链接了吗(尤其是查询B&C),有人有什么想法吗?

非常感谢!

------解决方案代码(来自下面的QHarr答案)-------------

注意:需要参考(VBE>工具>参考并向其中添加参考): Microsoft Internet控件 Microsoft HTML对象库

Public Sub GetTable()
    Dim IE As InternetExplorer, ele As Object, clipboard As Object, hTable As htmlTable, t As Date, ws As Worksheet
    Set ws = ThisWorkbook.Worksheets("Sheet1")
    Const MAX_WAIT_SEC As Long = 20
    Set IE = New InternetExplorer
    With IE
        .Visible = True
        .navigate "https://na3.docusign.net/Member/EmailStart.aspx?a=59595fcb-34be-4375-b880-a0be581d0f37&r=f6d28b49-e66d-4fa4-a7e9-69c2c741fde5"
        While .Busy Or .readyState < 4: DoEvents: Wend
        t = Timer
        Do
            DoEvents
            On Error Resume Next
            Set ele = .Document.querySelector("[data-qa='show-history']")
            'On Error GoTo 0 'I removed this line as it was throwing an error as soon as the 'Show-history' element loaded.
            If Timer - t > MAX_WAIT_SEC Then Exit Do
        Loop While ele Is Nothing

        If ele Is Nothing Then Exit Sub

        ele.Click

        While .Busy Or .readyState < 4: DoEvents: Wend

        Set hTable = .Document.querySelector("[data-qa='history-dialog-audit-logs']")

        ''**********************************************************************
        '' Loop table and write out method. This method uses the sub WriteTable
        Application.ScreenUpdating = False  
        WriteTable hTable, 1, ws
        Application.ScreenUpdating = True
        ''**********************************************************************
        .Quit
    End With
End Sub

Public Sub WriteTable(ByVal hTable As htmlTable, Optional ByVal startRow As Long = 1, Optional ByVal ws As Worksheet)
    If ws Is Nothing Then Set ws = ActiveSheet
    Dim tRow As Object, tCell As Object, tr As Object, td As Object, r As Long, c As Long
    r = startRow
    With ws
        Set tRow = hTable.getElementsByTagName("tr")
        For Each tr In tRow
            Set tCell = tr.getElementsByTagName("td")
            c = 1
            For Each td In tCell
                .Cells(r, c).Value = td.innerText
                c = c + 1
            Next td
            r = r + 1
        Next tr
    End With
End Sub

------原始查询代码-------

查询A: 我在加载页面时收到一个对象必需的错误,如果继续执行脚本,该错误将消失,因此我认为处理加载时间是否有问题?它在“循环”代码完成后发生:

    With objIE
        .Visible = True
        .navigate WebSite
        Do While .Busy Or .readyState <> 4
            DoEvents
        Loop

        .document.querySelector("[data-qa='show-history']").Click

查询B: 我在此行上收到另一个对象必需的错误,也可以继续过去:

For Each ele In objIE.document.getElementById("activity").getElementsByTagName("tr")

查询C: 我在下一行出现下标超出范围的错误,并且无法继续进行

Sheets(“ Sheet1”)。Range(“ A”&y).Value = ele.Children(0).textContent

enter image description here 完整代码:

Sub googlesearch3()
    Set objIE = CreateObject("InternetExplorer.Application")
    WebSite = "websiteurl"

    With objIE
        .Visible = True
        .navigate WebSite
        Do While .Busy Or .readyState <> 4
            DoEvents
        Loop

        .document.querySelector("[data-qa='show-history']").Click
End With
'within the 'history-dialog-audit-logs' tabe, loop and extract data


    'we will output data to excel, starting on row 1
    y = 1

    'look at all the 'tr' elements in the 'table' with id 'myTable',
    'and evaluate each, one at a time, using 'ele' variable
    For Each ele In objIE.document.getElementById("activity").getElementsByTagName("tr")
        'show the text content of 'tr' element being looked at
        Debug.Print ele.textContent
        'each 'tr' (table row) element contains 4 children ('td') elements
        'put text of 1st 'td' in col A
        Sheets("Sheet1").Range("A" & y).Value = ele.Children(0).textContent
        'put text of 2nd 'td' in col B
        Sheets("Sheet1").Range("B" & y).Value = ele.Children(1).textContent
        'put text of 3rd 'td' in col C
        Sheets("Sheet1").Range("C" & y).Value = ele.Children(2).textContent
        'put text of 4th 'td' in col D
        Sheets("Sheet1").Range("D" & y).Value = ele.Children(3).textContent
        'increment row counter by 1
        y = y + 1
    'repeat until last ele has been evaluated
    Next

'check if word 'completed' is mentoined anwhere, if so update 'Status' to 'Completed' and search for text.

'Find "signed the envelope" and show all text before this until you find <td?. Stop after one occurance
'store text in 'LastSigned'string

'find "sent an invitation to" and show all text before this until you find <td>. Stop after one occurance
'store text in 'CurrentlyWith' sting


 Set IE = Nothing

End Sub

其他: 我尝试了答案here,但DIM语句不起作用...

1 个答案:

答案 0 :(得分:1)

这里有两种写出表格的方法。一种是使用剪贴板,另一种是通过在行内循环行和表单元格(该版本被注释掉-3行)。我使用的循环时间为MAX_WAIT_SEC秒,以允许设置clickable元素来尝试解决您的问题1。对于我来说,没有足够的HTML来对问题2和问题3给出很好的解释。一开始可能都与计时问题有关。

注意:通常,在.Click之后,您希望另一个While .Busy Or .readyState < 4: DoEvents: Wend,并可能另一个Do Loop,以便更新页面内容。

Option Explicit
Public Sub GetTable()
    Dim IE As InternetExplorer, ele As Object, clipboard As Object, hTable As HTMLTable, t As Date, ws As Worksheet
    Set ws = ThisWorkbook.Worksheets("Sheet1")
    Const MAX_WAIT_SEC As Long = 5
    Set IE = New InternetExplorer
    With IE
        .Visible = True
        .navigate "yourURL"
        While .Busy Or .readyState < 4: DoEvents: Wend
        t = Timer
        Do
            DoEvents
            On Error Resume Next
            Set ele = .Document.querySelector("[data-qa='show-history']")
            On Error GoTo 0
            If Timer - t > MAX_WAIT_SEC Then Exit Do
        Loop While ele Is Nothing

        If ele Is Nothing Then Exit Sub

        ele.Click

        While .Busy Or .readyState < 4: DoEvents: Wend

        Set hTable = .Document.querySelector("#activity .dstable")

        ''*********************************************************************
        ''Copy table to clipboard and paste  method
        Set clipboard = GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")
        clipboard.SetText = hTable.outerHTML
        clipboard.PutInClipboard
        ws.Cells(1, 1).PasteSpecial
        ''**********************************************************************

        ''**********************************************************************
        '' Loop table and write out method. This method uses the sub WriteTable
        ' Application.ScreenUpdating = False  '<==Uncomment these 3 lines and comment out lines above if using this method.
        ' WriteTable hTable, 1, ws
        ' Application.ScreenUpdating = True
        ''**********************************************************************
        .Quit
    End With
End Sub

Public Sub WriteTable(ByVal hTable As HTMLTable, Optional ByVal startRow As Long = 1, Optional ByVal ws As Worksheet)
    If ws Is Nothing Then Set ws = ActiveSheet
    Dim tRow As Object, tCell As Object, tr As Object, td As Object, r As Long, c As Long
    r = startRow
    With ws
        Set tRow = hTable.getElementsByTagName("tr")
        For Each tr In tRow
            Set tCell = tr.getElementsByTagName("td")
            c = 1
            For Each td In tCell
                .Cells(r, c).Value = td.innerText
                c = c + 1
            Next td
            r = r + 1
        Next tr
    End With
End Sub

参考(VBE>工具>参考并向其中添加参考):

  1. Microsoft Internet控件
  2. Microsoft HTML对象库

编辑:在某些情况下,后期绑定剪贴板引用现在似乎有问题。这是通用的早期绑定方法,其中hTable是目标HTMLTable对象。

对于剪贴板早期绑定,请访问VBE>工具>参考> Microsoft-Forms 2.0对象库。

如果将UserForm添加到项目中,该库将自动添加。

Dim clipboard As DataObject
Set clipboard = New DataObject
clipboard.SetText hTable.outerHTML
clipboard.PutInClipboard
ThisWorkbook.Worksheets("Sheet1").Cells(1, 1).PasteSpecial