html报废vba中的双类

时间:2018-03-01 22:05:47

标签: excel-vba web-scraping vba excel

我正在尝试使用下面的VBA代码从this HTML page中提取价格:

这是HTML代码段:

<div class="box-text box-text-products">
    <div class="title-wrapper">
        <p class="category uppercase is-smaller no-text-overflow product-cat op-7">
    Xikar Lighters      
        </p>
        <p class="name product-title">
            <a href="https://www.havanahouse.co.uk/product/xikar-allume-single-jet-flame-racing-cigar-lighter-bluewhite-stripe/">Xikar Allume Single Jet Flame Racing Cigar Lighter &#8211; Blue/White Stripe</a>
        </p>
    </div>
    <div class="price-wrapper">
        <span class="price">
            <del>
                <span class="woocommerce-Price-amount amount">
                    <span class="woocommerce-Price-currencySymbol">&pound;</span>48.00
                </span>
            </del>
            <ins>
                <span class="woocommerce-Price-amount amount">
                    <span class="woocommerce-Price-currencySymbol">&pound;</span>45.00
                </span>
            </ins>
        </span>
    </div>
</div>
<!-- box-text -->undefined</div>undefined<!-- box -->undefined</div>undefined<!-- .col-inner -->undefined</div>undefined<!-- col -->

我使用下面的代码,但收到错误:

For Each oElement In oHtml.getElementsByClassName("woocommerce-Price-amoun t amount")
    If oElement.getElementsByTagName("del") Then Exit For

    If oElement.innerText <> 0  Then
        Cells(counter, 3) = CDbl(oElement.innerText)
        counter = counter + 1
    End If
Next oElement

1 个答案:

答案 0 :(得分:0)

看一下下面的例子:

Option Explicit

Sub Test()

    Dim sUrl As String
    Dim oWS As Worksheet
    Dim i As Long
    Dim sResp As String
    Dim sCont As String
    Dim oMatch

    sUrl = "https://www.havanahouse.co.uk/?post_type=product"
    Set oWS = ThisWorkbook.Sheets(1)
    oWS.Cells.Delete
    i = 1
    Do
        With CreateObject("MSXML2.XMLHTTP")
            .Open "GET", sUrl, False
            .send
            sResp = .ResponseText
        End With
        With CreateObject("VBScript.RegExp")
            .Global = True
            .MultiLine = True
            .IgnoreCase = True
            .Pattern = "<div class=""shop-container"">([\s\S]*?)<div class=""container"">"
            With .Execute(sResp)
                If .Count = 0 Then Exit Do
                sCont = .Item(0).SubMatches(0)
            End With
            .Pattern = "<div class=""title-wrapper"">([\s\S]*?)</div><div class=""price-wrapper"">([\s\S]*?)</div>"
            For Each oMatch In .Execute(sCont)
                oWS.Cells(i, 1) = GetInnerText(oMatch.SubMatches(0))
                oWS.Cells(i, 2) = GetInnerText(oMatch.SubMatches(1))
                oWS.Columns.AutoFit
                i = i + 1
                DoEvents
            Next
            oWS.Cells(i, 1).Select
            .Pattern = "<a class=""next page-number""[\s\S]*?href=""([^""]*)"""
            With .Execute(sResp)
                If .Count = 0 Then Exit Do
                sUrl = .Item(0).SubMatches(0)
            End With
        End With
    Loop

End Sub

Function GetInnerText(sText As String) As String

    Static oHtmlfile As Object
    Static oDiv As Object

    If oHtmlfile Is Nothing Then
        Set oHtmlfile = CreateObject("htmlfile")
        oHtmlfile.Open
        Set oDiv = oHtmlfile.createElement("div")
    End If
    oDiv.innerHTML = sText
    GetInnerText = oDiv.innerText

End Function

我的输出如下:

output

一般情况下,建议不要使用RegEx进行HTML解析,因此there is disclaimer。在这种情况下处理的数据非常简单,这就是使用RegEx解析的原因。关于RegEx:introduction(特别是syntax),introduction JSVB flavor

BTW使用类似方法还有另一个答案:12345