我在VBA中创建了一个解析器,它可以从黄页加拿大中删除该名称。但是,问题是页面包含84个名称,但我的解析器只抓取41个名称。我怎样才能解决这个问题?任何帮助都是我的祝福。提前致谢。这是代码:
http.Open "GET", "http://www.yellowpages.ca/search/si/1/Outdoor%20wedding/Edmonton", False
http.send
html.body.innerHTML = http.responseText
Set topics = html.getElementsByClassName("listing__name--link jsListingName")
For Each topic In topics
Cells(x, 1) = topic.innerText
x = x + 1
Next topic
顺便说一句,我使用了MSxml2.xmlhttp60请求。
答案 0 :(得分:2)
如果您查看网页的网页请求,您会发现一旦页面滚动到某一点后它会触发另一个网页请求。
新请求的格式如下:
前40条记录: http://www.yellowpages.ca/search/si/1/Outdoor%20wedding/Edmonton
接下来的40条记录: http://www.yellowpages.ca/search/si/2/Outdoor%20wedding/Edmonton
接下来的40条记录: http://www.yellowpages.ca/search/si/3/Outdoor%20wedding/Edmonton
基本上对于新数据(以40个记录的批量),它将URL的一部分增加1。
哪个好消息,我们可以做一个循环来返回结果。这是我提出的代码。无论出于何种原因,getElementsByClassName
选择器对我不起作用,所以我在我的代码中解决了这个问题。如果您可以使用该选择器,请使用该选择器代替我在下面的部分。
最后,我添加了对Microsoft XML v6.0
的明确引用,因此您应该这样做以使其按原样运行。
Option Explicit
Public Sub SOTestScraper()
Dim topics As Object
Dim topic As Object
Dim webResp As Object
Dim i As Long
Dim j As Long
Dim mySheet As Worksheet: Set mySheet = ThisWorkbook.Sheets("Sheet1") ' Change this
Dim myArr() As Variant: ReDim myArr(10000) 'Probably overkill
For i = 1 To 20 ' unsure how many records you expect, I defaulted to 20 pages, or 800 results
Set webResp = getWebResponse(CStr(i)) ' return the web response
Set topics = webResp.getElementsByTagName("*") ' I couldn't find the className so I did this instead
If topics Is Nothing Then Exit For 'Exit the for loop if Status 200 wasn't received
For Each topic In topics
On Error Resume Next
'If getElementByClassName is working for you, use it
If topic.ClassName = "listing__name--link jsListingName" Then
myArr(j) = topic.InnerText
j = j + 1
End If
Next
Next
'add the data to Excel
ReDim Preserve myArr(j - 1)
mySheet.Range("A1:A" & j) = WorksheetFunction.Transpose(myArr)
End Sub
Function getWebResponse(ByVal pageNumber As String) As Object
Dim http As MSXML2.ServerXMLHTTP60: Set http = New MSXML2.ServerXMLHTTP60
Dim html As Object: Set html = CreateObject("htmlfile")
With http
.Open "GET", "http://www.yellowpages.ca/search/si/" & pageNumber & "/Outdoor%20wedding/Edmonton"
.send
.waitForResponse
html.body.innerHTML = .responseText
.waitForResponse
End With
If Not http.Status = 200 Then
Set getWebResponse = Nothing
Else
Set getWebResponse = html
End If
Set html = Nothing
Set http = Nothing
End Function