一半的记录被刮掉了84

时间:2017-03-10 20:29:47

标签: vba web-scraping

我在VBA中创建了一个解析器,它可以从黄页加拿大中删除该名称。但是,问题是页面包含84个名称,但我的解析器只抓取41个名称。我怎样才能解决这个问题?任何帮助都是我的祝福。提前致谢。这是代码:

http.Open "GET", "http://www.yellowpages.ca/search/si/1/Outdoor%20wedding/Edmonton", False
http.send
html.body.innerHTML = http.responseText

Set topics = html.getElementsByClassName("listing__name--link jsListingName")
For Each topic In topics
    Cells(x, 1) = topic.innerText
    x = x + 1
Next topic

顺便说一句,我使用了MSxml2.xmlhttp60请求。

1 个答案:

答案 0 :(得分:2)

如果您查看网页的网页请求,您会发现一旦页面滚动到某一点后它会触发另一个网页请求。

新请求的格式如下:

前40条记录: http://www.yellowpages.ca/search/si/1/Outdoor%20wedding/Edmonton

接下来的40条记录: http://www.yellowpages.ca/search/si/2/Outdoor%20wedding/Edmonton

接下来的40条记录: http://www.yellowpages.ca/search/si/3/Outdoor%20wedding/Edmonton

基本上对于新数据(以40个记录的批量),它将URL的一部分增加1。

哪个好消息,我们可以做一个循环来返回结果。这是我提出的代码。无论出于何种原因,getElementsByClassName选择器对我不起作用,所以我在我的代码中解决了这个问题。如果您可以使用该选择器,请使用该选择器代替我在下面的部分。

最后,我添加了对Microsoft XML v6.0的明确引用,因此您应该这样做以使其按原样运行。

Option Explicit

Public Sub SOTestScraper()
    Dim topics      As Object
    Dim topic       As Object
    Dim webResp     As Object
    Dim i           As Long
    Dim j           As Long
    Dim mySheet     As Worksheet: Set mySheet = ThisWorkbook.Sheets("Sheet1") ' Change this
    Dim myArr()     As Variant: ReDim myArr(10000) 'Probably overkill

    For i = 1 To 20 ' unsure how many records you expect, I defaulted to 20 pages, or 800 results
        Set webResp = getWebResponse(CStr(i)) ' return the web response
        Set topics = webResp.getElementsByTagName("*") ' I couldn't find the className so I did this instead
        If topics Is Nothing Then Exit For 'Exit the for loop if Status 200 wasn't received
        For Each topic In topics
            On Error Resume Next
            'If getElementByClassName is working for you, use it
            If topic.ClassName = "listing__name--link jsListingName" Then
                myArr(j) = topic.InnerText
                j = j + 1
            End If
        Next
    Next

    'add the data to Excel
    ReDim Preserve myArr(j - 1)
    mySheet.Range("A1:A" & j) = WorksheetFunction.Transpose(myArr)
End Sub

Function getWebResponse(ByVal pageNumber As String) As Object
    Dim http As MSXML2.ServerXMLHTTP60: Set http = New MSXML2.ServerXMLHTTP60
    Dim html As Object: Set html = CreateObject("htmlfile")

    With http
        .Open "GET", "http://www.yellowpages.ca/search/si/" & pageNumber & "/Outdoor%20wedding/Edmonton"
        .send
        .waitForResponse
        html.body.innerHTML = .responseText
        .waitForResponse
    End With

    If Not http.Status = 200 Then
        Set getWebResponse = Nothing
    Else
        Set getWebResponse = html
    End If

    Set html = Nothing
    Set http = Nothing
End Function