通过TagName进行网页爬取

时间:2019-04-06 14:02:41

标签: excel vba internet-explorer web-scraping

我正在尝试从网站上获取一些数据,但是由于我是Web抓取的新手,因此在标签名称,类代码和ID中感到困惑。我对此只有基本的知识。 我想在“数据”下面复制,如果数据不存在,则该单元格应留空,并且代码需要移至下一个值。

print

我尝试构建代码,但是无法准确判断需要选择哪个标签名称,以下是我的代码,请帮助我复制此数据。

print 'print1', result2()

1 个答案:

答案 0 :(得分:2)

XHR:

所有信息都可以通过XMLHTTP (XHR) request获得-比打开浏览器快得多。

我首先使用.main li[class]的css选择器检索行数。 "."class selectorlitype selector,而[class]attribute selector" "之间的空格是descendant combinator。这表明我想检索具有类属性的所有li标记/类型元素,并且其父类的名称为main

此匹配项如下:

如您所见,这给了我行数;从结果集中检索信息的父li元素的数量。

li elements的此集合由querySelectorAll作为nodeList返回。我无法遍历此列表,将getElementsByClassName / querySelector应用于各个节点,因为li元素没有公开我可以使用的方法。

现在,由于我没有使用浏览器,因此我不得不依靠HTMLDocument对象可用的方法。与使用浏览器不同,通过VBA自动化时,我无权访问它们支持的有限pseudo class selectors,这将允许我使用诸如:nth-of-type之类的选择器语法来访问单个行。这是使用VBA进行网页抓取的烦人限制。

那么,我们该怎么办?好吧,在这种情况下,我可以将每个节点的innerHTML转储到另一个HTMLDocument变量html2中,以便可以访问该对象的querySelector/querySelectorAll方法。这样,HTML将仅限于当前的li

如果我们查看有问题的HTML:

我们可以看到li元素是一般的同级元素。他们在同一级别彼此相邻坐着。循环nodeList listings时,我正在将innerHTML从当前节点传输到html2中;我的第二个HTMLDocument变量。

值得注意的是,我可能可以使用children来列出每个列表,例如:

listings.item(i).Children(2)......

然后我可以在newLines上进行拆分,以便访问所有信息。我认为我的给定方法虽然更快,更强大。

VBA:

Option Explicit  
Public Sub GetInfo()
    Dim ws As Worksheet, html As HTMLDocument, s As String
    Const URL As String = "https://www.neighborhoodselfstorage.net/self-storage-delmar-md-f1426"

    Set ws = ThisWorkbook.Worksheets("Sheet1")
    Set html = New HTMLDocument
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", URL, False
        .setRequestHeader "User-Agent", "Mozilla/5.0"
        .send
        s = .responseText
        html.body.innerHTML = s

        Dim headers(), results(), listings As Object, amenities As String

        headers = Array("Size", "Description", "Amenities", "Offer1", "Offer2", "RateType", "Price")
        Set listings = html.querySelectorAll(".main li[class]")

        Dim rowCount As Long, numColumns As Long, r As Long, c As Long
        Dim icons As Object, icon As Long, amenitiesInfo(), i As Long, item As Long

        rowCount = listings.Length
        numColumns = UBound(headers) + 1

        ReDim results(1 To rowCount, 1 To numColumns)
        Dim html2 As HTMLDocument
        Set html2 = New HTMLDocument
        For item = 0 To listings.Length - 1
            r = r + 1
            html2.body.innerHTML = listings.item(item).innerHTML
            'size,description, amenities,specials offer1 offer2, rate type, price

            results(r, 1) = Trim$(html2.querySelector(".size").innerText)
            results(r, 2) = Trim$(html2.querySelector(".description").innerText)
            Set icons = html2.querySelectorAll("i[title]")

            ReDim amenitiesInfo(0 To icons.Length - 1)

            For icon = 0 To icons.Length - 1
                amenitiesInfo(icon) = icons.item(icon).getAttribute("title")
            Next

            amenities = Join$(amenitiesInfo, ", ")

            results(r, 3) = amenities
            results(r, 4) = html2.querySelector(".offer1").innerText
            results(r, 5) = html2.querySelector(".offer2").innerText
            results(r, 6) = html2.querySelector(".rate-label").innerText
            results(r, 7) = html2.querySelector(".price").innerText
        Next

        ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
        ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
    End With
End Sub

Internet Explorer:

假设未从给定的URL重定向。在这里,我使用:nth-​​of-type伪类选择器来定位列表的每一行。这些行是li(列表)元素,其中包含每个框列表的信息。我建立了一个css选择器字符串,该字符串指定行,然后指定我要跟随的行中的元素。我将该字符串传递给querySelectorquerySelectorAll,它返回匹配的元素。

Option Explicit

Public Sub UseIE()
    Dim ie As New InternetExplorerm, ws As Worksheet
    Const Url As String = "https://www.neighborhoodselfstorage.net/self-storage-delmar-md-f142"

    Set ws = ThisWorkbook.Worksheets("Sheet1")

    With ie
        .Visible = True
        .Navigate2 Url

        While .Busy Or .readyState < 4: DoEvents: Wend

        Dim headers(), results(), listings As Object, listing As Object, amenities As String

        headers = Array("Size", "Description", "Amenities", "Offer1", "Offer2", "RateType", "Price")

        Set listings = .document.querySelectorAll(".main li[class]")

        Dim rowCount As Long, numColumns As Long, r As Long, c As Long
        Dim icons As Object, icon As Long, amenitiesInfo(), i As Long

        rowCount = listings.Length
        numColumns = UBound(headers) + 1
        ReDim results(1 To rowCount, 1 To numColumns)
        For Each listing In listings
            r = r + 1
            'size,description, amenities,specials offer1 offer2, rate type, price
            With .document

                results(r, 1) = Trim$(.querySelector(".main li:nth-of-type(" & r & ") .size").innerText)
                results(r, 2) = Trim$(.querySelector(".main li:nth-of-type(" & r & ") .description").innerText)

                Set icons = .querySelectorAll("." & Join$(Split(listing.className, Chr$(32)), ".") & ":nth-of-type(" & r & ") i[title]")

                ReDim amenitiesInfo(0 To icons.Length - 1)

                For icon = 0 To icons.Length - 1
                    amenitiesInfo(icon) = icons.item(icon).getAttribute("title")
                Next

                amenities = Join$(amenitiesInfo, ",")
                results(r, 3) = amenities
                results(r, 4) = .querySelector(".main li:nth-of-type(" & r & ") .offer1").innerText
                results(r, 5) = .querySelector(".main li:nth-of-type(" & r & ") .offer2").innerText
                results(r, 6) = .querySelector(".main li:nth-of-type(" & r & ") .rate-label").innerText
                results(r, 7) = .querySelector(".main li:nth-of-type(" & r & ") .price").innerText
            End With
        Next
        .Quit
        ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
        ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
    End With
End Sub


参考(VBE>工具>参考):

  1. Microsoft HTML对象库
  2. Microsoft Internet控件