vba,getElementsByClassName,HTMLSource的双引号都消失了

时间:2015-12-16 01:24:35

标签: excel vba excel-vba getelementsbyclassname

我用vba抓取一些网站以获得乐趣,我使用VBA作为工具。我使用XMLHTTP和HTMLDocument(因为它比internetExplorer.Application更快)。

Public Sub XMLhtmlDocumentHTMLSourceScraper()

    Dim XMLHTTPReq As Object
    Dim htmlDoc As HTMLDocument

    Dim postURL As String

    postURL = "http://foodffs.tumblr.com/archive/2015/11"

        Set XMLHTTPReq = New MSXML2.XMLHTTP

        With XMLHTTPReq
            .Open "GET", postURL, False
            .Send
        End With

        Set htmlDoc = New HTMLDocument
        With htmlDoc
            .body.innerHTML = XMLHTTPReq.responseText
        End With

        i = 0

        Set varTemp = htmlDoc.getElementsByClassName("post_glass post_micro_glass")

        For Each vr In varTemp
            ''''the next line is important to solve this issue *1
            Cells(1, 1) = vr.outerHTML
            Set varTemp2 = vr.getElementsByTagName("SPAN class=post_date")
            Cells(i + 1, 3) = varTemp2.Item(0).innerText
            ''''the next line occur 438Error''''
            Set varTemp2 = vr.getElementsByClassName("hover_inner")
            Cells(i + 1, 4) = varTemp2.innerText

            i = i + 1

        Next vr
End Sub

我通过* 1来解决这个问题 cell(1,1)向我展示了接下来的事情

<DIV class="post_glass post_micro_glass" title=""><A class=hover title="" href="http://foodffs.tumblr.com/post/134291668251/sugar-free-low-carb-coffee-ricotta-mousse-really" target=_blank>
<DIV class=hover_inner><SPAN class=post_date>...............

是的,所有课程标签丢失了#34; &#34 ;.只有第一个函数的类具有&#34; &#34; 我真的不知道为什么会出现这种情况。

//我可以通过getElementsByTagName(&#34; span&#34;)进行解析。但我更喜欢&#34; class&#34;标签.....

2 个答案:

答案 0 :(得分:5)

getElementsByClassName method不被认为是一种方法;仅限父HTMLDocument。如果要使用它来定位DIV元素中的元素,则需要创建一个由该特定DIV元素的.outerHtml组成的子HTMLDocument。

Public Sub XMLhtmlDocumentHTMLSourceScraper()

    Dim xmlHTTPReq As New MSXML2.XMLHTTP
    Dim htmlDOC As New HTMLDocument, divSUBDOC As New HTMLDocument
    Dim iDIV As Long, iSPN As Long, iEL As Long
    Dim postURL As String, nr As Long, i As Long

    postURL = "http://foodffs.tumblr.com/archive/2015/11"

    With xmlHTTPReq
        .Open "GET", postURL, False
        .Send
    End With

    'Set htmlDOC = New HTMLDocument
    With htmlDOC
        .body.innerHTML = xmlHTTPReq.responseText
    End With

    i = 0

    With htmlDOC
        For iDIV = 0 To .getElementsByClassName("post_glass post_micro_glass").Length - 1
            nr = Sheet1.Cells(Rows.Count, 3).End(xlUp).Offset(1, 0).Row
            With .getElementsByClassName("post_glass post_micro_glass")(iDIV)
                'method 1 - run through multiples in a collection
                For iSPN = 0 To .getElementsByTagName("span").Length - 1
                    With .getElementsByTagName("span")(iSPN)
                        Select Case LCase(.className)
                            Case "post_date"
                                Cells(nr, 3) = .innerText
                            Case "post_notes"
                                Cells(nr, 4) = .innerText
                            Case Else
                                'do nothing
                        End Select
                    End With
                Next iSPN
                'method 2 - create a sub-HTML doc to facilitate getting els by classname
                divSUBDOC.body.innerHTML = .outerHTML  'only the HTML from this DIV
                With divSUBDOC
                    If CBool(.getElementsByClassName("hover_inner").Length) Then 'there is at least 1
                        'use the first
                        Cells(nr, 5) = .getElementsByClassName("hover_inner")(0).innerText
                    End If
                End With
            End With
        Next iDIV
    End With

End Sub

虽然其他 .getElementsByXXXX 可以轻松地检索另一个元素中的集合,但getElementsByClassName method需要考虑它认为HTMLDocument整体的内容,即使您已经将其误解为思考这一点。

答案 1 :(得分:1)

这是另一种方法。它与原始代码非常相似,但使用querySelectorAll来选择相关的span元素。此方法的一个重点是必须将vr声明为特定元素类型,而不是IHTMLElement或通用对象: 选项明确 Public Sub XMLhtmlDocumentHTMLSourceScraper() '从通用对象更改为特定类型 - 不是 '这是必须的 Dim XMLHTTPReq作为MSXML2.XMLHTTP60 Dim htmlDoc作为HTMLDocument '这些声明未包含在原始代码中 Dim i As Integer 将varTemp作为对象调暗 '重要提示:必须将vr声明为特定的元素类型而不是 '作为IHTMLElement或通用对象 Dim vr As HTMLDivElement Dim varTemp2 As Object 昏暗的postURL As String postURL =“http://foodffs.tumblr.com/archive/2015/11” '从XMLHTTP更改为XMLHTTP60,因为XMLHTTP是等效的 '到较旧的XMLHTTP30 设置XMLHTTPReq = New MSXML2.XMLHTTP60 使用XMLHTTPReq     。打开“GET”,postURL,False     。发送 结束 设置htmlDoc =新的HTMLDocument 用htmlDoc     .body.innerHTML = XMLHTTPReq.responseText 结束 i = 0 设置varTemp = htmlDoc.getElementsByClassName(“post_glass post_micro_glass”) 对于每个vr在varTemp中    ''''下一行对解决这个问题非常重要* 1    单元格(1,1)= vr.outerHTML    设置varTemp2 = vr.querySelectorAll(“span.post_date”)    单元格(i + 1,3)= varTemp2.​​Item(0).innerText    设置varTemp2 = vr.getElementsByClassName(“hover_inner”)    '结合来自Jeeped评论的纠正(#56349646)    单元格(i + 1,4)= varTemp2.​​Item(0).innerText    i = i + 1 下一个vr 结束子 笔记: XMLHTTP等同于此处描述的XMLHTTP30 显然需要声明在这个问题中探讨的特定元素类型,但与getElementsByClassName不同,querySelectorAll在任何版本的IHTMLElement中都不存在