web scraping vba"对象变量或块变量未设置错误"和浏览器崩溃

时间:2016-01-18 21:06:23

标签: excel vba web-scraping xmlhttprequest

所以我尝试将多个网站的数据从excel中删除。我认为代码在理论上运行良好,虽然我有"对象变量或块变量没有设置错误"。

我面临的第一个问题是,它有时会从2到10完美地循环并插入每一位数据,但有时我会收到错误并看到只插入了1或2行数据。我真的无法弄清楚到底是什么原因。

其次,这是某种演示代码。我只使用一小批数据并通过它循环。我的真正目的是找到一种方法,使网络抓取最多100行,而不会崩溃我的计算机或浏览器。如果我将我的代码转换为XMLHTTP类型的抓取是否会更好,如果是这样我该怎么办呢。

先谢谢

Private Sub CommandButton1_Click()

    Dim ie As Object
    Dim iexp As Object
    Dim firstname(1 To 10), lastname(1 To 10) As Variant
    Dim mm(1 To 10), dd(1 To 10), yyyy(1 To 10) As Integer
    Dim PhoneNumber(1 To 10) As Variant
    Dim Address(1 To 10) As Variant
    Dim HomeValue(1 To 10) As Variant


    Dim i As Integer



        For i = 2 To 10

        'get variables from excel sheet1 and search on peoplefinders.com

                firstname(i) = Sheet1.Cells(i, 1).Value
                lastname(i) = Sheet1.Cells(i, 2).Value
                mm(i) = Sheet1.Cells(i, 3).Value
                dd(i) = Sheet1.Cells(i, 4).Value
                yyyy(i) = Sheet1.Cells(i, 5).Value

                Set ie = CreateObject("InternetExplorer.Application")
                ie.Visible = True
                ie.Height = 1000
                ie.Width = 1000
                ie.navigate ("http://www.peoplefinders.com/peoplesearch/searchresults?search=People&fn=" & firstname(i) & "&mn=&ln=" & lastname(i) & "&city=&state=&age=&dobmm=" & mm(i) & "&dobdd=" & dd(i) & "&doby=" & yyyy(i))

                Do While ie.Busy: DoEvents: Loop
                Dim Doc As HTMLDocument

                Set Doc = ie.document

          'get elements and insert into cells in sheet 1

               PhoneNumber(i) = Doc.getElementsByTagName("td")(2).getElementsByTagName("a")(0).innerText
               Address(i) = Doc.getElementsByTagName("td")(1).getElementsByTagName("a")(0).innerText

               Sheet1.Cells(i, 6).Value = PhoneNumber(i)
               Sheet1.Cells(i, 7).Value = Address(i)

          'modify address for next search

               a = Split(Address(i), " ")
               b = Join(a, "-")

          'search home value on zillow.com

               Set iexp = CreateObject("InternetExplorer.Application")
                iexp.Visible = True
                iexp.Height = 1000
                iexp.Width = 1000
                iexp.navigate ("http://www.zillow.com/homes/" & b & "_rb/")

                Do While iexp.Busy: DoEvents: Loop
                Dim Doc2 As HTMLDocument

                Set Doc2 = iexp.document

               iexp.navigate ("http://www.zillow.com/homes/" & b & "_rb/")

          'insert home value into cells in sheet 1

               HomeValue(i) = Doc2.getElementsByClassName("home-summary-row")(1).getElementsByTagName("span")(1).innerText

               Sheet1.Cells(i, 8).Value = HomeValue(i)



        Next


    End Sub

2 个答案:

答案 0 :(得分:0)

正如Kerem Turgutlu指出的那样,检查Busy通常是否足够,你必须检查readystate。我是这样做的:

Sub WaitBrowser(browser As Object)
     Do While browser.Busy
           DoEvents
     Loop
     Do While browser.readyState <> 4
           DoEvents
     Loop
End Sub

然后我在每次导航后调用WaitBrowser IE(其中IE是我的InternetExplorer.Application对象),然后插入文档元素。

至于其他方法,只要有可能,为了提高效率和可预测性,我更喜欢直接使用API​​发送HTTP消息(我通常使用WinHTTP,但我相信也可以使用XMLHTTP或winInet)。两个值得注意的例外:1)我想在处理期间或之后将用户引导到浏览器,或者2)在查找要发送的内容以获取所需数据时涉及复杂的脚本(在这种情况下,让浏览器更容易执行其操作作业)。

这是一个改编自最近项目的例子(没有错误检查):

Function FindLink() As String
    Dim Request as Object
    Set Request = CreateObject("WinHttp.WinHttpRequest.5.1")
    Request.Open "GET", "http://example.com/pagewithinfo"
    Request.Send
    Dim resp as String
    resp = Request.ResponseText
    'create html tree with response
    Dim h As Object
    Set h = CreateObject("htmlfile")
    h.body.innerHTML = respA
    'get the info
    FindLink = h.DocumentElement.GetElementsByTagName("a")(0).GetAttribute("href")
    Set h = Nothing
    Set Request = Nothing 
End Function

答案 1 :(得分:0)

XHR请求在第一个结果页面上列出所有结果。您可以在循环中使用它来访问不同的结果页面。

需要引用HTML Object LibraryMicrosoft XML libraryXMLHTTP60特定于Excel2016。您可能需要适应您的版本,例如,从末尾删除60。

Option Explicit
Public Sub GetInfo()
    Dim http As New XMLHTTP60, html As New HTMLDocument
    Dim ws As Worksheet: Set ws = ActiveSheet
    Const URL As String = "https://www.peoplefinders.com/peoplesearch/searchresults?search=People&fn=Je&mn=&ln=Bloggs&city=&state=AL&age=&dobmm=&dobdd=&doby="
    Application.ScreenUpdating = True
    With http
        .Open "GET", URL, False
        .send
        html.body.innerHTML = .responseText
    End With

    Dim hTables As Object, hTable As Object
    Set hTables = html.getElementsByTagName("table")

    For Each hTable In hTables
        WriteTable hTable, GetLastRow(ws, 1) + 1, ws
    Next hTable
    Application.ScreenUpdating = True
End Sub

Public Function GetLastRow(ByVal ws As Worksheet, ByVal columNum As Long) As Long
    With ws
        GetLastRow = .Cells(.Rows.Count, "A").End(xlUp).Row
    End With
End Function

Public Sub WriteTable(ByVal hTable As HTMLTable, Optional ByVal startRow As Long = 1, Optional ByVal ws As Worksheet)

    If ws Is Nothing Then Set ws = ActiveSheet

    Dim tSection As Object, tRow As Object, tCell As Object, tr As Object, td As Object, R As Long, C As Long, tBody As Object
    R = startRow
    With ws
        Dim headers As Object, header As Object, columnCounter As Long
        Set headers = hTable.getElementsByTagName("th")
        For Each header In headers
            columnCounter = columnCounter + 1
            .Cells(startRow, columnCounter) = header.innerText
        Next header
        startRow = startRow + 1
        Set tBody = hTable.getElementsByTagName("tbody")
        For Each tSection In tBody               'HTMLTableSection
            Set tRow = tSection.getElementsByTagName("tr") 'HTMLTableRow
            For Each tr In tRow
                R = R + 1
                Set tCell = tr.getElementsByTagName("td")
                C = 1
                For Each td In tCell             'DispHTMLElementCollection
                    .Cells(R, C).Value = td.innerText 'HTMLTableCell
                    C = C + 1
                Next td
            Next tr
        Next tSection
    End With
End Sub

您可以通过在URL字符串的末尾添加页码来循环浏览各个页面,例如

https://www.peoplefinders.com/peoplesearch/searchresults?fn=Je&ln=Bloggs&state=AL&search=People&StartPage=2

您可以获得要循环的页面数

Dim a As Object, numPages As Long
Set a = html.querySelectorAll("a.pageslinks.PaginationLinks")
numPages = a(a.Length - 1).innerText

这样可以确定何时停止循环进行特定搜索的结果页面。