VBA跨多个网页抓取

时间:2019-03-01 15:52:14

标签: excel vba web-scraping screen-scraping


因此,我有以下代码可以从网站上抓取数据,并且可以正常运行。
我的“问题”是因为我要抓取的网站具有分页脚本,因此我需要运行代码处理多个网页。
例如:一页上有48条记录,但是在大多数情况下,该页上有200条以上的记录,但它们又细分为3/4页。
我的代码:

Public Sub Roupa()
    Dim data As Object, i As Long, html As HTMLDocument, r As Long, c As Long, item As Object, div As Object
    Set html = New HTMLDocument                  '<== VBE > Tools > References > Microsoft HTML Object Library
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.worten.pt/grandes-eletrodomesticos/maquinas-de-roupa/maquinas-de-roupa-ver-todos-marca-BALAY-e-BOSCH-e-SIEMENS?per_page=100", False
        .send
        html.body.innerHTML = .responseText
    End With
    Set data = html.getElementsByClassName("w-product__content")
    For Each item In data
        r = r + 1: c = 1
        For Each div In item.getElementsByTagName("div")
            With ThisWorkbook.Worksheets("Roupa")
                .Cells(r, c) = div.innerText
            End With
            c = c + 1
        Next
    Next
    Sheets("Roupa").Range("A:A,C:C,F:F,G:G,H:H,I:I").EntireColumn.Delete
End Sub

更新
我试过在For n = 1 To 2之前添加此With,它可以工作,但是我需要知道确切的页面数,因此并没有太大帮助。

1 个答案:

答案 0 :(得分:1)

通过将结果计数除以每页的结果来计算出多少页。然后执行循环,将适当的页码连接到url

Option Explicit
Public Sub Roupa()
    Dim data As Object, i As Long, html As HTMLDocument, r As Long, c As Long, item As Object, div As Object
    Set html = New HTMLDocument                  '<== VBE > Tools > References > Microsoft HTML Object Library
    Const RESULTS_PER_PAGE As Long = 48
    Const START_URL As String = "https://www.worten.pt/grandes-eletrodomesticos/maquinas-de-roupa/maquinas-de-roupa-ver-todos-marca-BALAY-e-BOSCH-e-SIEMENS?per_page=" & RESULTS_PER_PAGE & "&page=1"

    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", START_URL, False
        .setRequestHeader "User-Agent", "Mozilla/5.0"
        .send
        html.body.innerHTML = .responseText
        Dim numPages As Long, numResults As Long, arr() As String
        arr = Split(html.querySelector(".w-filters__element").innerText, Chr$(32))
        numResults = arr(UBound(arr))
        numPages = 1
        If numResults > RESULTS_PER_PAGE Then
            numPages = Application.RoundUp(numResults / RESULTS_PER_PAGE, 0)
        End If

        For i = 1 To numPages
             If i > 1 Then
                .Open "GET", Replace$("https://www.worten.pt/grandes-eletrodomesticos/maquinas-de-roupa/maquinas-de-roupa-ver-todos-marca-BALAY-e-BOSCH-e-SIEMENS?per_page=" & RESULTS_PER_PAGE & "&page=1", "page=1", "page=" & i), False
                .setRequestHeader "User-Agent", "Mozilla/5.0"
                .send
                 html.body.innerHTML = .responseText
            End If
            Set data = html.getElementsByClassName("w-product__content")
            For Each item In data
                r = r + 1: c = 1
                For Each div In item.getElementsByTagName("div")
                    With ThisWorkbook.Worksheets("Roupa")
                        .Cells(r, c) = div.innerText
                    End With
                    c = c + 1
                Next
            Next
        Next
    End With
    Sheets("Roupa").Range("A:A,C:C,F:F,G:G,H:H,I:I").EntireColumn.Delete
End Sub

考虑@AhmedAu所说的内容,如果页面已正确加载,那么看起来也可以获取页面计数的一种好方法就是简单地使用:

numPages = html.querySelectorAll("[data-page]").Length