探索Web抓取中的Instr VBA功能

时间:2019-07-09 03:30:30

标签: html excel vba web-scraping

我想使用VBA InStr函数抓取该URL https://www.realtor.com/realestateandhomes-search/06510,并提取所有带有此子字符串“ 06510”的URL

这是我一直在努力的示例代码。

Option Explicit

Sub GetLinks()


    '
    'To use HTMLDocument you need to set a reference to Tools -> References -> Microsoft HTML Object Library
    Dim HTML As New HTMLDocument
    Dim http As Object
    Dim links As Object
    Dim link As HTMLHtmlElement
    Dim counter As Long
    Dim website As Range
    Dim LastRange As Range
    Dim row As Long
    Dim continue As Boolean
    Dim respHead As String
    Dim lRow As Long

    Application.ScreenUpdating = False

    ' The row where website addresses start
    row = 30
    continue = True

    lRow = Cells(Rows.count, 1).End(xlUp).row + 1

    ' XMLHTTP gives errors where ServerXMLHTTP does not
    ' even when using the same URL's
    'Set http = CreateObject("MSXML2.XMLHTTP")

    Set http = CreateObject("MSXML2.ServerXMLHTTP.6.0")

    Do While continue

        ' Could set this to first cell with URL then OFFSET columns to get next web site
        Set website = Range("A" & row)

        Set LastRange = Range("B" & lRow)

        If Len(website.Value) < 1 Then

            continue = False
            Exit Sub

        End If

        If website Is Nothing Then

            continue = False

        End If

        'Debug.Print website

        With http

            On Error Resume Next
            .Open "GET", website.Value, False
            .send

            ' If Err.Num is not 0 then an error occurred accessing the website
            ' This checks for badly formatted URL's. The website can still return an error
            ' which should be checked in .Status

            'Debug.Print Err.Number

            ' Clear the row of any previous results
            Range("B" & row & ":e" & row).Clear

            ' If the website sent a valid response to our request
            If Err.Number = 0 Then


                If .Status = 200 Then

                    HTML.body.innerHTML = http.responseText

                    Set links = HTML.getElementsByTagName("a")

                    For Each link In links

                        If InStr(link.outerHTML, "06510") Then

                            LastRange.Value = link.href

                        End If




                    Next

                End If

                Set website = Nothing

            Else

                    'Debug.Print "Error loading page"
                    LastRange.Value = "Error with website address"

            End If

                    On Error GoTo 0

        End With

        row = row + 1

    Loop

    Application.ScreenUpdating = True

End Sub

检查页面后,以下是要提取的URL类型的示例-https://www.realtor.com/realestateandhomes-detail/239-Bradley-St_New-Haven_CT_06510_M36855-92189。任何帮助将不胜感激

以简化的方式使用QHarr的代码...

     Sub GetLinks()

    Dim url As String, links_count As Integer
    Dim j As Integer, row As Integer
    Dim XMLHTTP As Object, html As Object
    'Dim tr_coll As Object, tr As Object
    'Dim elements As Object
    Dim i As Long, allLinksOfInterest As Object
    'Dim td_coll As Object, td As Object, td_col, objT

    url = "https://www.realtor.com/realestateandhomes-search/06510"

    Set XMLHTTP = CreateObject("MSXML2.XMLHTTP")
    XMLHTTP.Open "GET", url, False
    XMLHTTP.send

    Set html = CreateObject("htmlfile")
    html.body.innerHTML = XMLHTTP.responseText



    Set allLinksOfInterest = html.querySelectorAll("[href*='06510']")

    For i = 0 To allLinksOfInterest.Length - 1
    Debug.Print allLinksOfInterest.Item(i).href
    Next

End Sub

请检查我是否有任何遗漏。我仍然收到错误“对象不支持此属性或方法”

2 个答案:

答案 0 :(得分:1)

If InStr(link.outerHTML, "06510") Then

在上面的代码中,InStr函数的使用类似于布尔函数。但是它不是布尔值,而是返回整数。因此,您应该在函数后添加比较运算符。可能像:

If InStr(link.outerHTML, "06510")>0 Then

答案 1 :(得分:1)

在所有Instr标签的循环期间,请勿在整个节点externalHTML上使用a。有时需要这样做,但不应该是其中之一(希望如此)。

您想将attribute = value css选择器与contains*运算符一起使用。专门用于匹配属性值中的子字符串的目的。这样更有效。

Set allLinksOfInterest = html.querySelectorAll("[href*='06510']")

所以

Dim i As Long, allLinksOfInterest As Object

Set allLinksOfInterest = html.querySelectorAll("[href*='06510']")
For i = 0 To allLinksOfInterest.Length - 1
    Debug.Print allLinksOfInterest.Item(i).href
Next

属性=包含包含运算符的值:

  

[attr * = value]

     

表示属性名称为attr的元素,其属性   值在字符串中至少包含一个值。


VBA:

当前产生26个链接。所有都是相对链接,因此需要添加域,如循环所示。有些是重复项,因此请考虑将其作为键添加到字典中,以便删除重复项。

Option Explicit

Public Sub GetLinks()
    Dim html As HTMLDocument

    Set html = New HTMLDocument
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.realtor.com/realestateandhomes-search/06510", False
        .send
        html.body.innerHTML = .responseText
    End With

    Dim i As Long, allLinksOfInterest As Object

    Set allLinksOfInterest = html.querySelectorAll("[href*='06510']")
    For i = 0 To allLinksOfInterest.Length - 1
        Debug.Print Replace$(allLinksOfInterest.item(i).href,"about:","https://www.realtor.com")
    Next
End Sub