VBA WebScraping脚本返回下标超出范围

时间:2019-03-06 11:04:40

标签: excel vba web-scraping screen-scraping

问题:

我一直在尝试从一个Web站点抓取数据,但是它总是给我一个错误下标超出范围我不知道为什么,我对另一个Web站点使用了完全相同的代码,效果很好。

是的,我已经更改了我要从中删除的新网站上的divs

有人可以帮我吗?预先感谢。

代码:

Option Explicit
Public Sub Loiça()
    Dim data As Object, i As Long, html As HTMLDocument, r As Long, c As Long, item As Object, div As Object
    Set html = New HTMLDocument                  '<== VBE > Tools > References > Microsoft HTML Object Library

    Const START_URL As String = "https://mediamarkt.pt/pages/search-results-page?q=maquina+roupa&page=1"

    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", START_URL, False
        .setRequestHeader "User-Agent", "Mozilla/5.0"
        .send
        html.body.innerHTML = .responseText
        Dim numPages As Long, numResults As Long, arr() As String
        arr = Split(html.querySelector(".snize-search-results-header").innerText, Chr$(32))
        numResults = arr(UBound(arr))
        numPages = 1


        For i = 1 To numPages
             If i > 1 Then
                .Open "GET", Replace$("https://mediamarkt.pt/pages/search-results-page?q=maquina+roupa&page=1", "page=1", "page=" & i), False
                .setRequestHeader "User-Agent", "Mozilla/5.0"
                .send
                 html.body.innerHTML = .responseText
            End If
            Set data = html.getElementsByClassName("snize-four-columns")
            For Each item In data
                r = r + 1: c = 1
                For Each div In item.getElementsByTagName("div")
                    With ThisWorkbook.Worksheets("Loiça")
                        .Cells(r, c) = div.innerText
                    End With
                    c = c + 1
                Next
            Next
        Next
    End With
    '----------------------------------------------------------------------------------------------------------------------------------------------------------------------'
End Sub

1 个答案:

答案 0 :(得分:3)

This

html.querySelector(".snize-search-results-header").innerText

is returning an empty string, therefore when you do split you end up with a -1 in your arr.

It may be that this value requires javascript to run on page. Inspect the returned html. I think no value is returned for that. Use a method like selenium or IE which allows js to run on the page and update content with the value

In this case you also need the lbound so you could use a function that returns the value which uses IE

numPages  =  GetNumberOfPages 

Public Function GetNumberOfPages() As Long
    Dim IE As New InternetExplorer
    With IE
        .Visible = False
        .Navigate2 "https://mediamarkt.pt/pages/search-results-page?q=maquina+roupa&page=1"

        While .Busy Or .readyState < 4: DoEvents: Wend

        Dim numPages As Long, numResults As Long, arr() As String
        arr = Split(.document.querySelector(".snize-search-results-header").innerText, Chr$(32))
        numResults = arr(LBound(arr))
        GetNumberOfPages =  numResults
        .Quit
    End With
End Function

For the next page you are looking at a different class name (I think)

Set data = html.getElementsByClassName("snize-product")

Inspect the html to verify.