用VBA卡住刮网

时间:2018-08-13 09:51:06

标签: vba excel-vba web-scraping

我正在尝试使用VBA自动化网络抓取工具,以收集某些商品的价格数据。我对VBA还是很陌生,一直试图从此处使用类似主题的答案来建立我的代码,但是由于“类型不匹配”而被卡住了。我用它来打开IE,效果很好:

    Dim appIE As Object
    Set appIE = CreateObject("internetexplorer.application")

    With appIE
        .Navigate "https://grocery.walmart.com/"
        .Visible = True
    End With

    Do While appIE.Busy
        DoEvents
    Loop

但是,我现在希望能找到价格,例如高露洁的价格为1.67美元, 在以下代码中,自然谷地的价格为2.78美元:

<span data-automation-id="items">
<div class="CartItem__itemContainer___3vA-E" tabindex="-1" data-automation-id="cartItem">
<div class="CartItem__itemInfo___3rgQd">
<span class="TileImage__tileImage___35CNo">
<div class="TileImage__imageContainer___tlQZb">
<img alt="1 of C, o" src="https://i5.walmartimages.com/asr/36829cef-43f2-4d21-9d5e-10aa9def01dd_7.04089903cc0038b3dac3c204ef7e417e.png?odnHeight=150&amp;odnWidth=150&amp;odnBg=ffffff" class="TileImage__image___3MrIo" data-automation-id="image" aria-hidden="true">
</div><span data-automation-id="quantity" class="TileImage__quantity___1rgG4 hidden__audiblyHidden___RoAkK" role="button" aria-label="1 of C, select to change quantities">
1</span></span><div class="CartItem__name___2RJs5">
<div data-automation-id="name" tabindex="0" role="button" aria-label="C button, Select to change quantities">
Colgate Cavity Protection Fluoride Toothpaste - 6 oz</div><span data-automation-id="list-price" class="ListPrice__listPrice___1x8TM" aria-label="1 dollar and 67 cents  each">
$1.67 each</span><a class="CartItem__detailsLink___2ts9b" aria-label="Colgate Cavity Protection Fluoride Toothpaste - 6 oz" tabindex="0" href="/ip/Colgate-Cavity-Protection-Fluoride-Toothpaste---6-oz/49714957">
View details</a></div><span class="Price__groceryPriceContainer___19Jim CartItem__price___2ADX6" data-automation-id="price" aria-label="1 dollar and 67 cents ">
<sup class="Price__currencySymbol___3Ye7d">
$</sup><span class="Price__wholeUnits___lFhG5" data-automation-id="wholeUnits">
1</span><sup class="Price__partialUnits___1VX5w" data-automation-id="partialUnits">
67</sup></span></div><div></div></div><div class="CartItem__itemContainer___3vA-E" tabindex="-1" data-automation-id="cartItem">
<div class="CartItem__itemInfo___3rgQd">
<span class="TileImage__tileImage___35CNo">
<div class="TileImage__imageContainer___tlQZb">
<img alt="1 of N, a" src="https://i5.walmartimages.com/asr/775482d5-a136-4ca3-9353-28646ec999c3_1.d861ce7abd9797cbafec2cd2a4b24874.jpeg?odnHeight=150&amp;odnWidth=150&amp;odnBg=ffffff" class="TileImage__image___3MrIo" data-automation-id="image" aria-hidden="true">
</div><span data-automation-id="quantity" class="TileImage__quantity___1rgG4 hidden__audiblyHidden___RoAkK" role="button" aria-label="1 of N, select to change quantities">
1</span></span><div class="CartItem__name___2RJs5">
<div data-automation-id="name" tabindex="0" role="button" aria-label="N button, Select to change quantities">
Nature Valley Granola Bars Sweet and Salty Nut Cashew 6 Bars - 1.2 oz</div><span data-automation-id="list-price" class="ListPrice__listPrice___1x8TM" aria-label="2 dollars and 78 cents  each">
$2.78 each</span><a class="CartItem__detailsLink___2ts9b" aria-label="Nature Valley Granola Bars Sweet and Salty Nut Cashew 6 Bars - 1.2 oz" tabindex="0" href="/ip/Nature-Valley-Granola-Bars-Sweet-and-Salty-Nut-Cashew-6-Bars---1.2-oz/10311347">
View details</a></div><span class="Price__groceryPriceContainer___19Jim CartItem__price___2ADX6" data-automation-id="price" aria-label="2 dollars and 78 cents ">
<sup class="Price__currencySymbol___3Ye7d">
$</sup><span class="Price__wholeUnits___lFhG5" data-automation-id="wholeUnits">
2</span><sup class="Price__partialUnits___1VX5w" data-automation-id="partialUnits">
78</sup></span></div><div></div></div>

我的本​​能(作为一个真正的初学者)是找到上面的div类部分,然后搜索aria-label并复制其后的文本,但是我觉得它确实会遇到很多麻烦,并且最终可能会如果在页面的其他位置重复该div类术语,则会产生大量的错误。

任何有关我应该如何进行的帮助(如果这是个好主意)将非常有帮助。谢谢!

1 个答案:

答案 0 :(得分:0)

可以使用针对类属性的CSS选择器选择所有价格:

[class='Price__groceryPriceContainer___19Jim CartItem__price___2ADX6']

您将通过querySelectorAll的{​​{1}}方法应用CSS选择器,该方法将返回document

您也可以使用以下方式获取收藏集:

nodeList

代码大纲:

.document.getElementsByClassName("Price__groceryPriceContainer___19Jim CartItem__price___2ADX6")

固定篮子项目:

牙膏:

如果购物车中的物品保持固定,并且价格随时间推移在购物篮中更新,您可以跟踪牙膏价格的变化,例如,如果您使用CSS选择器:

Option Explicit
Public Sub TEST()
    Dim appIE As Object
    Set appIE = CreateObject("internetexplorer.application")

    With appIE
        .navigate "https://grocery.walmart.com/" '> Travel to homepage
        .Visible = True '< Show browser window

        Do While .Busy = True Or .readyState <> 4: DoEvents: Loop '< Wait for page to have loaded

        Dim priceList As Object, namesList As Object, i As Long, ws As Worksheet, lastRow As Long
        Set ws = ActiveSheet
        'Code to get your basket ready
        lastRow = GetLastRow(ws, 1)

        Set priceList = .document.querySelectorAll("[class='Price__groceryPriceContainer___19Jim CartItem__price___2ADX6']")  'Select elements by their class attribute (match on basket item prices)
        Set nameList = .document.querySelectorAll("[ data-automation-id='name']")

        For i = 0 To priceList.Length - 1 '< Loop the nodeList of matched elements
            With ws
                .Cells(lastRow + i + 1, 1) = nameList.item(i).innerText '<access the name of each matched element
                .Cells(lastRow + i + 1, 2) = Now
                .Cells(lastRow + i + 1, 3) = priceList.item(i).innerText '<access the price of each matched element
            End With
        Next i
    End With
End Sub

Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
    With ws
        GetLastRow = .Cells(.Rows.count, columnNumber).End(xlUp).Row
    End With
End Function

所以:

.CartItem__name___2RJs5 + span

或者:

Debug.Print .document.querySelector(".CartItem__name___2RJs5 + span").innerText

最后一个使用class属性返回所有匹配元素(您的购物篮)的nodeList并按索引0访问第一个项目(牙膏):

index

或者您可以使用Debug.Print .document.querySelectorAll("[class='Price__groceryPriceContainer___19Jim CartItem__price___2ADX6']").item(0).innerText 方法,该方法将返回第一个匹配项,即索引0:

.querySelector

我的代码通过使用CSS选择器(页面样式)来匹配元素的class属性来定位元素。您所有的购物篮商品价格都具有类别属性Debug.Print .document.querySelector("[class='Price__groceryPriceContainer___19Jim CartItem__price___2ADX6']").innerText 。因此,我的代码向后拉了具有此类属性的元素的nodeList(有点像数组)。遍历nodeList的长度以按索引访问每个元素(从0开始)。 Price__groceryPriceContainer___19Jim CartItem__price___2ADX6属性返回元素的文字字符串值,即价格。