如何从所有页面的表格中抓取数据?

时间:2019-05-14 09:38:20

标签: excel vba web-scraping pagination

我正在从网站中提取数据,而我的代码仅提取前两页。

我尝试放置一个for循环,但它无法导航到其他页面。

这是HTML代码:

<div class="dataTables_length" id="activitylog_table_length"><label>Show <select name="activitylog_table_length" aria-controls="activitylog_table" class="custom-select custom-select-sm form-control form-control-sm">
<option value="10">10</option>
<option value="25">25</option>
<option value="50">50</option>
<option value="100">100</option>
<option value="200">200</option>
<option value="500">500</option></select> entries</label></div>



<div class="dataTables_info" id="activitylog_table_info" role="status" aria-live="polite">Showing 1 to 10 of 668 entries</div>

<div class="col-sm-12 col-md-7"><div class="dataTables_paginate paging_full_numbers" id="activitylog_table_paginate"><ul class="pagination">
<li class="paginate_button page-item first disabled" id="activitylog_table_first">
<a href="#" aria-controls="activitylog_table" data-dt-idx="0" tabindex="0" class="page-link">
<i class="la la-angle-double-left"></i></a></li><li class="paginate_button page-item previous disabled" id="activitylog_table_previous">
<a href="#" aria-controls="activitylog_table" data-dt-idx="1" tabindex="0" class="page-link">
<i class="la la-angle-left"></i>
</a>
</li><li class="paginate_button page-item active"><a href="#" aria-controls="activitylog_table" data-dt-idx="2" tabindex="0" class="page-link">1</a>
</li><li class="paginate_button page-item "><a href="#" aria-controls="activitylog_table" data-dt-idx="3" tabindex="0" class="page-link">2</a>
</li><li class="paginate_button page-item "><a href="#" aria-controls="activitylog_table" data-dt-idx="4" tabindex="0" class="page-link">3</a>
</li><li class="paginate_button page-item "><a href="#" aria-controls="activitylog_table" data-dt-idx="5" tabindex="0" class="page-link">4</a>
</li><li class="paginate_button page-item "><a href="#" aria-controls="activitylog_table" data-dt-idx="6" tabindex="0" class="page-link">5</a>
</li><li class="paginate_button page-item disabled" id="activitylog_table_ellipsis"><a href="#" aria-controls="activitylog_table" data-dt-idx="7" tabindex="0" class="page-link">…</a>
</li><li class="paginate_button page-item "><a href="#" aria-controls="activitylog_table" data-dt-idx="8" tabindex="0" class="page-link">67</a>
</li><li class="paginate_button page-item next" id="activitylog_table_next">
<a href="#" aria-controls="activitylog_table" data-dt-idx="9" tabindex="0" class="page-link">
<i class="la la-angle-right"></i>
</a><
/li><li class="paginate_button page-item last" id="activitylog_table_last"><a href="#" aria-controls="activitylog_table" data-dt-idx="10" tabindex="0" class="page-link"><i class="la la-angle-double-right"></i></a></li></ul></div></div>

Sub Extract()

Dim ie As Object
Dim btn As Object
Dim temp As Object
Dim Table As Object
Dim tRows As Object
Dim rNum As Integer
Dim cNum As Integer
Dim tCells As Object
Dim np As Variant
Dim numPages As String
Dim url As String
Dim pages As MSHTML.IHTMLElementCollection
Dim i As Integer
Dim NextHref As String
Dim NextURL As String

url = "https://admin.timesheetmobile.com/mr2/new/activity.php"

Set ie = CreateObject("InternetExplorer.Application")

ie.Visible = False

' Navigate to the webpage
ie.navigate url

 ' Wait while the page is loading
 While ie.Busy
      DoEvents
 Wend
 Application.Wait DateAdd("s", 3, Now)
 ' Wait an additional 3 seconds for good measure


Dim numPages As String
Set temp = ie.document.getElementsByClassName("dataTables_info")

numPages = temp(0).innerText

pos = Mid(numPages, 20, 3)
np = Round(pos, 0)

 rNum = 1
 cNum = 1

  Set Table = ie.document.getElementsByClassName("dataTables_scrollBody")

    Set tRows = Table(0).getElementsByTagName("tr")

    Set tHead = Table(0).getElementsByTagName("th")

    For Each h In tHead
        Sheet6.Cells(rNum, cNum).Value = h.innerText
        cNum = cNum + 1
    Next

    rNum = rNum + 1
    cNum = 1

For i = 1 To np

        Set tCells = r.getElementsByTagName("td")

        For Each c In tCells

            Sheet6.Cells(rNum, cNum).Value = c.innerText

            cNum = cNum + 1
        Next

        rNum = rNum + 1
        cNum = 1

    Next


    Set btn = ie.document.getElementsByClassName("paginate_button page-item next")
    btn(0).Click



Next

 ' Clear the ie object. This probably isn't necessary, but helps
 ' clean things up
Set ie = Nothing

结束子

我希望它从第1页到第np页提取所有数据。这可能吗?还是有其他方法可以做到这一点?

1 个答案:

答案 0 :(得分:0)

这是一些伪代码,但概述了提取页面数,然后单击下一步按钮,直到访问了所有页面。我将id选择器用于下一个按钮,因为它比复合类选择器更快捷,更可靠。

我从头开始处理ie.document,以避免过时的元素异常冒充页面循环中的错误而冒泡。

根据要如何写出表信息,您可能可以将信息存储在数组中;否则,也许可以在循环中通过找到工作表中的下一个可用行来写出表格.....我展示了如何在先前的SO答案here中写入下一个可用行。 This的答案向您展示了如何在循环期间使用剪贴板将表格发布到下一行。

一种更好的方法是,您可以通过XHR请求进行身份验证并获取所有信息,但目前无法确定是否可行。

Option Explicit
Public Sub test()
    Dim ie As New InternetExplorer, numPages As Long, length As Long

    With ie
        .Visible = True
        .navigate "loginURL"

        While .Busy Or .readyState < 4: DoEvents: Wend
        'login stuff here ....

        While .Busy Or .readyState < 4: DoEvents: Wend

        With .document
            length = .querySelectorAll(".page-link").length
            numPages = CLng(.querySelectorAll(".page-link").item(length - 3).innerText)
            'Assume on page 1 and extract last page number from length -2 (ignoring data-dt-idx="10" and data-dt-idx="9"
            'do something with page 1 then click through next button for num of pages
            For i = 2 To numPages
                .querySelector("#activitylog_table_next").Click
                .querySelector("[data-dt-idx='" & i + 1 & "']").click  'alternate
                While ie.Busy Or ie.readyState < 4: DoEvents: Wend
                'do something with other pages
            Next
            Stop '<=delete me later
        End With
        .Quit
    End With
End Sub

设置为每页500个

ie.document.querySelector("[value='500']").Selected = True