使用excel和VBA进行网页抓取

时间:2016-05-16 21:19:03

标签: excel vba web screen-scraping

我在excel表中写了我的VBA代码如下,但它不是我的数据,也不知道为什么请任何人帮助我。它给了我reullt“点击她阅读更多”onlyi想要抓取enitre数据,如名字姓氏状态邮政编码等等

Sub extractTablesData()
    Dim IE As Object, obj As Object
    Dim myState As String
    Dim r As Integer, c As Integer, t As Integer
    Dim elemCollection As Object

    Set IE = CreateObject("InternetExplorer.Application")

    myState = InputBox("Enter the city where you wish to work")

    With IE

        .Visible = True
        .navigate ("http://www.funeralhomes.com/go/listing/Search?  name=&city=&state=&country=USA&zip=&radius=")

        While IE.readyState <> 4
            DoEvents
        Wend

        For Each obj In IE.document.all.item("state").Options
            If obj.innerText = myState Then
                obj.Selected = True
            End If
        Next obj

        IE.document.getElementsByValue("Search").item.Click

        Do While IE.Busy: DoEvents: Loop

        ThisWorkbook.Sheets("Sheet1").Range("A1:K1500").ClearContents

        Set elemCollection = IE.document.getElementsByTagName("TABLE")

        For t = 0 To (elemCollection.Length - 1)

            For r = 0 To (elemCollection(t).Rows.Length - 1)
                For c = 0 To (elemCollection(t).Rows(r).Cells.Length - 1)
                    ThisWorkbook.Worksheets(1).Cells(r + 1, c + 1) = elemCollection(t).Rows(r).Cells(c).innerText
                Next c
            Next r
        Next t

    End With
    Set IE = Nothing
End Sub

2 个答案:

答案 0 :(得分:0)

是的,如果没有API,这可能是非常棘手的,最糟糕的是非常不一致。现在,您可以尝试下面的脚本。

Sub DumpData()

Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True

URL = "http://www.funeralhomes.com/go/listing/ShowListing/USA/New%20York/New%20York"

'Wait for site to fully load
IE.Navigate2 URL
Do While IE.Busy = True
   DoEvents
Loop

RowCount = 1

With Sheets("Sheet1")
   .Cells.ClearContents
   RowCount = 1
   For Each itm In IE.document.all
      If itm.classname Like "*free-listing*" Or itm.classname Like "*paid-listing*" Then
        .Range("A" & RowCount) = itm.classname
        .Range("B" & RowCount) = Left(itm.innertext, 1024)
            RowCount = RowCount + 1
      End If

   Next itm
End With
End Sub

您可能需要某种输入框来捕获用户的城市和状态以及半径,或者在工作表的单元格中捕获这些变量。

Notice, the '%20' is a space character.

很久以前,我从我的一个朋友乔尔那里得到了这个想法。那家伙很棒!

答案 1 :(得分:0)

使用与已经给出的答案相同的URL,您可以选择使用CSS选择器来获取感兴趣的元素,并使用split来获取文本中的名称和地址部分。我们也可以完全取消浏览器,以便从第一个结果页面获得更快的结果。

商家名称:

您可以使用以下选择器获取名称(使用付费列表示例):

div.paid-listing .listing-title

这选择(样本视图)

CSS query Try

地址信息:

可以使用选择器检索关联的描述性信息:

div.paid-listing .address-summary

然后使用split我们可以将其解析为地址信息。

代码:

Option Explicit
Public Sub GetTitleAndAddress()
    Dim oHtml As HTMLDocument, nodeList1 As Object, nodeList2 As Object, i As Long
    Const URL As String = "http://www.funeralhomes.com/go/listing/ShowListing/USA/New%20York/New%20York"
    Set oHtml = New HTMLDocument

    With CreateObject("WINHTTP.WinHTTPRequest.5.1")
        .Open "GET", URL, False
        .send
        oHtml.body.innerHTML = .responseText
    End With

    Set nodeList1 = oHtml.querySelectorAll("div.paid-listing .listing-title")
    Set nodeList2 = oHtml.querySelectorAll("div.paid-listing .address-summary")

    With Worksheets("Sheet3")
        .UsedRange.ClearContents
        For i = 0 To nodeList1.Length - 1
            .Range("A" & i + 1) = nodeList1.Item(i).innerText
            .Range("B" & i + 1) = Split(nodeList2.Item(i).innerText, Chr$(10))(0)
        Next i
    End With
End Sub

示例输出:

Output