需要帮助将基于Internet Explorer的Web抓取转换为XMLHTTP

时间:2018-01-31 06:09:35

标签: vba performance excel-vba optimization web-scraping

我正在尝试加快一些内部网webscraping并使其更可靠。我只是学习如何实现XMLHTTP,我需要一些建议,将我的代码从基于IE的报废转换为XMLHTTP。

我的模块中有2个subs,用于完成加载和导航Intranet站点(GetWebTable)并解析数据(GetOneTable)以返回excel中的表。潜艇如下:

Sub GetWebTable(sAccountNum As String)

On Error Resume Next

Dim objIE           As Object
Dim strBuffer       As String
Dim thisCol         As Integer
Dim iAcctCount      As Integer
Dim iCounter        As Integer
Dim iNextCounter    As Integer
Dim iAcctCell       As Integer
Dim thisColCustInfo As Integer
Dim iErrorCounter As Integer

    If InStr(1, sAccountNum, "-") <> 0 Then
        sAccountNum = Replace(sAccountNum, "-", "")
    End If

    If InStr(1, sAccountNum, " ") <> 0 Then
        sAccountNum = Replace(sAccountNum, " ", "")
    End If

    iErrorCounter = 1
    TRY_AGAIN:

    'Spawn Internet Explorer
        Set objIE = GetObject("new:{D5E8041D-920F-45e9-B8FB-XXXXXXX}") 
        DoEvents

        With objIE
            .Visible = False
            .Navigate "http://intranetsite.aspx"

            While .busy = True Or .readystate <> 4: DoEvents: Wend
            While .Document.readyState <> "complete": DoEvents: Wend

            .Document.getElementById("ctl00_MainContentRegion_tAcct").Value = sAcct

            While .busy = True Or .readyState <> 4: DoEvents: Wend
            While .Document.readyState <> "complete": DoEvents: Wend

            .Document.getElementById("ctl00_MainContentRegion_btnRunReport").Click

            While .busy = True Or .readyState <> 4: DoEvents: Wend
            While .Document.readyState <> "complete": DoEvents: Wend

        End With

            thisCol = 53
            thisColCustInfo = 53

        GetOneTable objIE.Document, 9, thisCol

        'Cleanup:
        objIE.Quit
        Set objIE = Nothing

    GetWebTable_Error:
    Select Case Err.Number
        Case 0
        Case Else
        Debug.Print Err.Number, Err.Description
        iErrorCounter = iErrorCounter + 1
        objIE.Quit
        Set objIE = Nothing
        If iErrorCounter > 4 Then On Error Resume Next
        GoTo TRY_AGAIN

            'Stop
    End Select
End Sub



Sub GetOneTable(varWebPageDoc, varTableNum, varColInsert)

Dim varDocElement   As Object ' the elements of the document
Dim varDocTable     As Object ' the table required
Dim varDocRow       As Object ' the rows of the table
Dim varDocCell      As Object ' the cells of the rows.
Dim Rng             As Range
Dim iCellCount      As Long
Dim iElemCount      As Long
Dim iTableCount     As Long
Dim iRowCount       As Long
Dim iRowCounter As Integer
Dim bTableEndFlag As Boolean

bTableEndFlag = False

For Each varDocElement In varWebPageDoc.all
    If varDocElement.nodeName = "TABLE" Then
        iElemCount = iElemCount + 1
    End If

    If iElemCount = varTableNum Then
        Set varDocTable = varDocElement

        iTableCount = iTableCount + 1
        iRowCount = iRowCount + 1
        Set Rng = Worksheets("Sheet1").Cells(2, varColInsert)

        For Each varDocRow In varDocTable.Rows

            For Each varDocCell In varDocRow.Cells
                If Left(varDocCell.innerText, 9) = "Total for" Then
                    bTableEndFlag = True
                    Exit For
                End If
                Rng.Value = varDocCell.innerText
                Set Rng = Rng.Offset(, 1)
                iCellCount = iCellCount + 1
            Next varDocCell

            iRowCount = iRowCount + 1
            Set Rng = Rng.Offset(1, -iCellCount)
            iCellCount = 0

        Next varDocRow

        Exit For

    End If

Next varDocElement

Set varDocElement = Nothing
Set varDocTable = Nothing
Set varDocRow = Nothing
Set varDocCell = Nothing
Set Rng = Nothing

End Sub

有什么想法吗?

1 个答案:

答案 0 :(得分:1)

HTML不是XML。严格执行XML是打开和关闭标记的术语,而HTML以<br>标记着称而不关闭</br>。如果HTML符合XML,那么你会非常幸运。

无论如何,如果你因为HTTP请求而想要使用XMLHTTP并且仍然保留基于IE的网页抓取代码,那么请参阅这篇文章http://exceldevelopmentplatform.blogspot.com/2018/01/vba-xmlhttp-request-xhr-does-not-parse.html它展示了如何在将响应传递给MSHTML之前使用XMLHTTP。

您可以独立于IE使用MSHTML,请参阅此文章Use MSHTML to parse local HTML file without using Internet Explorer (Microsoft HTML Object Library)。如果您读到,您将看到针对IE对象模型编写的大部分代码实际上都是MSHTML对象模型,因此您可以解耦并抛弃IE。享受!

EDIT1:不要忘记你可以询问贵公司的IT员工

你说它是一个暗示你公司内部的内部网站点,你可以要求负责该系统的程序员提供直接的API指南。

EDIT2:折叠有关如何模仿浏览器的反馈......

要模仿浏览器,您需要确定按钮点击产生的流量...

要观看网络流量,建议您切换到Chrome作为浏览器。然后,在此网页上,右键单击鼠标按钮并选择“检查”菜单选项,这将打开Chrome开发者工具。然后,在“开发人员工具”中选择“网络”选项卡,然后单击此页面上的链接,您将看到生成的流量。

因此,如果您想要使用纯XMLHTTP并保留浏览器,那么您将无法点击按钮,但您可以观察在浏览器中单击按钮时发生的网络流量,然后您可以模拟码。

例如,在您的评论中,您询问如何输入帐号并单击按钮。我猜测单击一个按钮会导致类似http://example.com/dowork/mypage.asp?accountnumber=1233456&otherParams=true的XMLHTTP调用,因此您会看到帐号将隐藏在查询参数中。获得该URL后,您可以将其放入XMLHTTP请求中。

一个潜在的问题是,系统设计人员可能选择隐藏HTTP POST正文中的帐号,因为它是敏感/机密数据。但是,Chrome开发者工具非常好,仍然应该提供这些信息,但可能不得不四处寻找。