我正在尝试加快一些内部网webscraping并使其更可靠。我只是学习如何实现XMLHTTP,我需要一些建议,将我的代码从基于IE的报废转换为XMLHTTP。
我的模块中有2个subs,用于完成加载和导航Intranet站点(GetWebTable)并解析数据(GetOneTable)以返回excel中的表。潜艇如下:
Sub GetWebTable(sAccountNum As String)
On Error Resume Next
Dim objIE As Object
Dim strBuffer As String
Dim thisCol As Integer
Dim iAcctCount As Integer
Dim iCounter As Integer
Dim iNextCounter As Integer
Dim iAcctCell As Integer
Dim thisColCustInfo As Integer
Dim iErrorCounter As Integer
If InStr(1, sAccountNum, "-") <> 0 Then
sAccountNum = Replace(sAccountNum, "-", "")
End If
If InStr(1, sAccountNum, " ") <> 0 Then
sAccountNum = Replace(sAccountNum, " ", "")
End If
iErrorCounter = 1
TRY_AGAIN:
'Spawn Internet Explorer
Set objIE = GetObject("new:{D5E8041D-920F-45e9-B8FB-XXXXXXX}")
DoEvents
With objIE
.Visible = False
.Navigate "http://intranetsite.aspx"
While .busy = True Or .readystate <> 4: DoEvents: Wend
While .Document.readyState <> "complete": DoEvents: Wend
.Document.getElementById("ctl00_MainContentRegion_tAcct").Value = sAcct
While .busy = True Or .readyState <> 4: DoEvents: Wend
While .Document.readyState <> "complete": DoEvents: Wend
.Document.getElementById("ctl00_MainContentRegion_btnRunReport").Click
While .busy = True Or .readyState <> 4: DoEvents: Wend
While .Document.readyState <> "complete": DoEvents: Wend
End With
thisCol = 53
thisColCustInfo = 53
GetOneTable objIE.Document, 9, thisCol
'Cleanup:
objIE.Quit
Set objIE = Nothing
GetWebTable_Error:
Select Case Err.Number
Case 0
Case Else
Debug.Print Err.Number, Err.Description
iErrorCounter = iErrorCounter + 1
objIE.Quit
Set objIE = Nothing
If iErrorCounter > 4 Then On Error Resume Next
GoTo TRY_AGAIN
'Stop
End Select
End Sub
Sub GetOneTable(varWebPageDoc, varTableNum, varColInsert)
Dim varDocElement As Object ' the elements of the document
Dim varDocTable As Object ' the table required
Dim varDocRow As Object ' the rows of the table
Dim varDocCell As Object ' the cells of the rows.
Dim Rng As Range
Dim iCellCount As Long
Dim iElemCount As Long
Dim iTableCount As Long
Dim iRowCount As Long
Dim iRowCounter As Integer
Dim bTableEndFlag As Boolean
bTableEndFlag = False
For Each varDocElement In varWebPageDoc.all
If varDocElement.nodeName = "TABLE" Then
iElemCount = iElemCount + 1
End If
If iElemCount = varTableNum Then
Set varDocTable = varDocElement
iTableCount = iTableCount + 1
iRowCount = iRowCount + 1
Set Rng = Worksheets("Sheet1").Cells(2, varColInsert)
For Each varDocRow In varDocTable.Rows
For Each varDocCell In varDocRow.Cells
If Left(varDocCell.innerText, 9) = "Total for" Then
bTableEndFlag = True
Exit For
End If
Rng.Value = varDocCell.innerText
Set Rng = Rng.Offset(, 1)
iCellCount = iCellCount + 1
Next varDocCell
iRowCount = iRowCount + 1
Set Rng = Rng.Offset(1, -iCellCount)
iCellCount = 0
Next varDocRow
Exit For
End If
Next varDocElement
Set varDocElement = Nothing
Set varDocTable = Nothing
Set varDocRow = Nothing
Set varDocCell = Nothing
Set Rng = Nothing
End Sub
有什么想法吗?
答案 0 :(得分:1)
HTML不是XML。严格执行XML是打开和关闭标记的术语,而HTML以<br>
标记着称而不关闭</br>
。如果HTML符合XML,那么你会非常幸运。
无论如何,如果你因为HTTP请求而想要使用XMLHTTP并且仍然保留基于IE的网页抓取代码,那么请参阅这篇文章http://exceldevelopmentplatform.blogspot.com/2018/01/vba-xmlhttp-request-xhr-does-not-parse.html它展示了如何在将响应传递给MSHTML之前使用XMLHTTP。
您可以独立于IE使用MSHTML,请参阅此文章Use MSHTML to parse local HTML file without using Internet Explorer (Microsoft HTML Object Library)。如果您读到,您将看到针对IE对象模型编写的大部分代码实际上都是MSHTML对象模型,因此您可以解耦并抛弃IE。享受!
EDIT1:不要忘记你可以询问贵公司的IT员工
你说它是一个暗示你公司内部的内部网站点,你可以要求负责该系统的程序员提供直接的API指南。
EDIT2:折叠有关如何模仿浏览器的反馈......
要模仿浏览器,您需要确定按钮点击产生的流量...
要观看网络流量,建议您切换到Chrome作为浏览器。然后,在此网页上,右键单击鼠标按钮并选择“检查”菜单选项,这将打开Chrome开发者工具。然后,在“开发人员工具”中选择“网络”选项卡,然后单击此页面上的链接,您将看到生成的流量。
因此,如果您想要使用纯XMLHTTP并保留浏览器,那么您将无法点击按钮,但您可以观察在浏览器中单击按钮时发生的网络流量,然后您可以模拟码。
例如,在您的评论中,您询问如何输入帐号并单击按钮。我猜测单击一个按钮会导致类似http://example.com/dowork/mypage.asp?accountnumber=1233456&otherParams=true
的XMLHTTP调用,因此您会看到帐号将隐藏在查询参数中。获得该URL后,您可以将其放入XMLHTTP请求中。
一个潜在的问题是,系统设计人员可能选择隐藏HTTP POST正文中的帐号,因为它是敏感/机密数据。但是,Chrome开发者工具非常好,仍然应该提供这些信息,但可能不得不四处寻找。