如何解析html而不在vba中创建Internet Explorer的对象?

时间:2017-12-01 11:12:29

标签: html excel vba excel-vba internet-explorer

我没有在工作的任何计算机上安装Internet Explorer,因此创建Internet Explorer的对象并使用ie.navigate解析html并搜索标签是不可能的。我的问题是,如何在不使用IE的情况下从帧源自动将标签中的某些数据拉到我的电子表格?答案中的代码示例非常有用:)谢谢

2 个答案:

答案 0 :(得分:5)

您可以使用 XMLHTTP 来检索网页的HTML源代码:

Function GetHTML(url As String) As String
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", url, False
        .Send
        GetHTML = .ResponseText
    End With
End Function

我不建议将其用作工作表函数,否则每次工作表重新计算时都会重新查询网站URL。有些网站有逻辑可以通过频繁的重复呼叫来检测抓取,并且您的IP可能暂时或永久地禁止,具体取决于网站。

获得源HTML字符串(最好存储在变量中以避免不必要的重复调用),您可以使用基本文本函数来解析字符串以搜索您的标记。

此基本功能将返回 <tag> </tag> 之间的值:

Public Function getTag(url As String, tag As String, Optional occurNum As Integer) As String
    Dim html As String, pStart As Long, pEnd As Long, o As Integer
    html = GetHTML(url)

    'remove <> if they exist so we can add our own
    If Left(tag, 1) = "<" And Right(tag, 1) = ">" Then
        tag = Left(Right(tag, Len(tag) - 1), Len(Right(tag, Len(tag) - 1)) - 1)
    End If

    ' default to Occurrence #1
    If occurNum = 0 Then occurNum = 1
    pEnd = 1

    For o = 1 To occurNum
        ' find start <tag> beginning at 1 (or after previous Occurence)
        pStart = InStr(pEnd, html, "<" & tag & ">", vbTextCompare)
        If pStart = 0 Then
            getTag = "{Not Found}"
            Exit Function
        End If
        pStart = pStart + Len("<" & tag & ">")

        ' find first end </tag> after start <tag>
        pEnd = InStr(pStart, html, "</" & tag & ">", vbTextCompare)
    Next o

    'return string between start <tag> & end </tag>
    getTag = Mid(html, pStart, pEnd - pStart)
End Function

这只会找到基本的 <tag> ,但您可以添加/删除/更改文字功能以满足您的需求。

示例用法:

Sub findTagExample()

    Const testURL = "https://en.wikipedia.org/wiki/Web_scraping"

    'search for 2nd occurence of tag: <h2> which is "Contents" :
    Debug.Print getTag(testURL, "<h2>", 2)

    '...this returns the 8th occurence, "Navigation Menu" :
    Debug.Print getTag(testURL, "<h2>", 8)

    '...and this returns an HTML <span> containing a title for the 'Legal Issues' section:
    Debug.Print getTag("https://en.wikipedia.org/wiki/Web_scraping", "<h2>", 4)

End Sub

答案 1 :(得分:0)

任何进行了网络抓取的人都会熟悉创建Internet Explorer(IE)实例并导航到网址,然后在页面准备就绪后开始使用&#39; Microsoft HTML对象导航DOM图书馆&#39; (MSHTML)类型库。该问题询问IE是否不可用。对于运行Windows 10的盒子,情况也一样。

我曾怀疑可以独立于IE创建一个MSHTML.HTMLDocument实例,但它的创建并不明显。感谢提问者现在问这个问题。答案在于MSHTML.IHTMLDocument4.createDocumentFromUrl方法。一个人需要一个本地文件才能工作(编辑:实际上也可以放一个webby url!)但是我们有一个很好的整洁的Windows API函数叫URLDownloadToFile来下载文件。

此代码在我运行Microsoft Edge的Windows 10机器上运行,而不是Internet Explorer。这是一个重要的发现,并感谢提问者。

Option Explicit

'* Tools->Refernces Microsoft HTML Object Library


'* MSDN - URLDownloadToFile function - https://msdn.microsoft.com/en-us/library/ms775123(v=vs.85).aspx
Private Declare PtrSafe Function URLDownloadToFile Lib "urlmon" Alias "URLDownloadToFileA" _
        (ByVal pCaller As Long, ByVal szURL As String, ByVal szFileName As String, _
        ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long

Sub Test()

    Dim fso As Object
    Set fso = CreateObject("Scripting.FileSystemObject")

    Dim sLocalFilename As String
    sLocalFilename = Environ$("TMP") & "\urlmon.html"

    Dim sURL As String
    sURL = "https://stackoverflow.com/users/3607273/s-meaden"


    Dim bOk As Boolean
    bOk = (URLDownloadToFile(0, sURL, sLocalFilename, 0, 0) = 0)
    If bOk Then
        If fso.FileExists(sLocalFilename) Then

            '* Tools->Refernces Microsoft HTML Object Library
            Dim oHtml4 As MSHTML.IHTMLDocument4
            Set oHtml4 = New MSHTML.HTMLDocument

            Dim oHtml As MSHTML.HTMLDocument
            Set oHtml = Nothing

            '* IHTMLDocument4.createDocumentFromUrl
            '* MSDN - IHTMLDocument4 createDocumentFromUrl method - https://msdn.microsoft.com/en-us/library/aa752523(v=vs.85).aspx
            Set oHtml = oHtml4.createDocumentFromUrl(sLocalFilename, "")

            '* need to wait a little whilst the document parses
            '* because it is multithreaded
            While oHtml.readyState <> "complete"
                DoEvents  '* do not comment this out it is required to break into the code if in infinite loop
            Wend
            Debug.Assert oHtml.readyState = "complete"


            Dim sTest As String
            sTest = Left$(oHtml.body.outerHTML, 100)
            Debug.Assert Len(Trim(sTest)) > 50  '* just testing we got a substantial block of text, feel free to delete

            '* page specific logic goes here
            Dim htmlAnswers As Object 'MSHTML.DispHTMLElementCollection
            Set htmlAnswers = oHtml.getElementsByClassName("answer-hyperlink")

            Dim lAnswerLoop As Long
            For lAnswerLoop = 0 To htmlAnswers.Length - 1
                Dim vAnswerLoop
                Set vAnswerLoop = htmlAnswers.Item(lAnswerLoop)
                Debug.Print vAnswerLoop.outerText

            Next

        End If
    End If
End Sub

感谢您提出这个问题。

P.S。我已经使用TaskList来验证在此代码运行时是否在引擎盖下创建了IExplore.exe。

P.P.S如果您喜欢这个,请在​​我的Excel Development Platform blog

上查看更多内容