VBA Excel在点击&#34后提取新的网页数据;提交"

时间:2014-07-11 15:18:50

标签: html excel vba web-scraping web-crawler

我试图从通过API编号提供油井数据的网站提取一些信息(API是美国每口井的唯一编号)

网站:http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellapi1

API示例:1708300502

问题是,当我到达第二页时,IE.document.getElementsByTagName(" body")(0).innerText仍然从初始页面返回数据。如何获取更新的页面数据?

最终目标是进入第2页,点击" 30570"通过IE.document.getElementsByTagName(" a")(0)。点击然后阅读最后的第3页。我只是想不通如何阅读更新的页面:(

Option Explicit

Sub sonris_WellData()
   Dim IE As InternetExplorer
   Set IE = CreateObject("InternetExplorer.Application")
   IE.Visible = True

   Dim i As Integer

   'Open SONRIS website
   Application.StatusBar = "Opening Website"
   IE.navigate "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellapi1"
   Do While IE.readyState <> 4: DoEvents: Loop
   Application.Wait Now() + TimeValue("00:00:01")
   Application.StatusBar = False

   IE.document.forms(0).p_apinum.Value = "1708300502" 'plug-in API
   IE.document.forms(0).submit

   ' Wait until the next page opens
   Application.StatusBar = "Opening Website"
   Do While IE.readyState <> 4: DoEvents: Loop
   Application.Wait Now() + TimeValue("00:00:01")
   Application.StatusBar = False

   ' Read the page - this is where the issue occurs, MsgBox keeps returning text from the very 1st page
   MsgBox IE.document.getElementsByTagName("body")(0).innerText

   IE.Quit
End Sub

2 个答案:

答案 0 :(得分:2)

这似乎有效。而不是DoEvents使用WinAPI睡眠功能。我还在表单提交后添加了对Sleep函数的调用。

通常我们会看到由某些javascript /等动态提供的网站,在这些情况下,浏览器可能看起来是READYSTATE_COMPLETE或不是Busy但页面尚未呈现“新的“结果。

Option Explicit
Public Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)
Sub sonris_WellData()
   Dim IE As Object 'InternetExplorer
   Set IE = CreateObject("InternetExplorer.Application")
   IE.Visible = True

   Dim i As Integer

   'Open SONRIS website
   Application.StatusBar = "Opening Website"
   IE.navigate "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellapi1"
   Do While IE.readyState <> 4
       Sleep 1000
   Loop

   Application.StatusBar = False

   IE.document.forms(0).p_apinum.Value = "1708300502" 'plug-in API
   IE.document.forms(0).submit

   Sleep 1000

   ' Wait until the next page opens
   Application.StatusBar = "Opening Website"
   Do While IE.readyState <> 4
    Sleep 1000
   Loop

   Application.StatusBar = False

   ' Read the page - this is where the issue occurs, MsgBox keeps returning text from the very 1st page
   MsgBox IE.document.getElementsByTagName("body")(0).innerText

   IE.Quit
End Sub

您可以在Sleep之后稍微长一点.submit进行试验。

或者,我注意到在您提交后,URL会发生变化,因此您也可以尝试将第二个等待循环更改为:

Do While IE.LocationURL ="http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellapi1"
    Sleep 1000
Loop

这应该让Excel.Application等到URL发生变化。

或者,使用XMLHTTPRequest可能会有更好的运气(在SO和其他地方有很多这样的例子)。这允许您像浏览器一样发送请求,而无需实际使用Web浏览器。然后,您可以简单地将返回文本解析为HTML或XML。我会使用Microsoft XML,v6.0库参考。

答案 1 :(得分:1)

POST请求:

①输入Well API编号

我检查了您选择的网页。我使用fiddler检查了网络流量,并注意到当您提交API编号时,最初的请求是由POST request处理的。

POST request


②POST请求:

POST正文具有以下参数:

POST param

p_apinum是密钥,而关联的值是原始的Well API号。

使用此信息,我直接制定了POST请求,从而避免了您的第一个着陆页。


③按下超链接:

接下来,我注意到您要按下的元素:

Hyperlink

查看关联的HTML,它具有关联的相对超链接:

HTML

我使用辅助函数来解析页面HTML以获得此相对链接并构建绝对路径:GetNextURL(page.body.innerHTML)


④提出新请求:

我重新使用HTTPRequest函数GetPage发送带有空主体的第二个请求,并从通过page.getElementsByTagName("table")返回的HTML文档中获取所有表。


⑤将表格写入Excel工作表:

我使用辅助函数AddHeaders循环页面上的所有表以写出表头,并使用WriteTables将当前表写到工作表中。


示例页面内容:

Example page content


示例代码输出:

Code output


VBA:

Option Explicit
Public Sub GetWellInfo()
    Dim ws As Worksheet, page As HTMLDocument, targetTable As HTMLTable, apiNumbers(), currNumber As Long
    Const PARAM1 As String = "p_apinum"
    Const BASESTRING As String = "http://sonlite.dnr.state.la.us/sundown/cart_prod/"
    apiNumbers = Array(1708300502, 1708300503)

    Application.ScreenUpdating = False
    Set ws = ThisWorkbook.Worksheets("Sheet1")

    With ws
        .Cells.ClearContents
        For currNumber = LBound(apiNumbers) To UBound(apiNumbers)
            Set page = GetPage(BASESTRING & "cart_con_wellapi2", apiNumbers(currNumber), PARAM1)
            Set page = GetPage(BASESTRING & GetNextURL(page.body.innerHTML))
            Dim allTables As Object
            Set allTables = page.getElementsByTagName("table")

            For Each targetTable In allTables
                AddHeaders targetTable, GetLastRow(ws, 1) + 2, ws
                WriteTables targetTable, GetLastRow(ws, 1), ws
            Next targetTable

        Next currNumber
    End With
    Application.ScreenUpdating = True
End Sub

Public Function GetPage(ByVal url As String, Optional ByVal apiNumber As Long, Optional ByVal paramN As String = vbNullString) As HTMLDocument
    Dim objHTTP As Object, html As New HTMLDocument

    Set objHTTP = CreateObject("WinHttp.WinHttpRequest.5.1")

    Dim sBody As String
    If Not paramN = vbNullString Then sBody = paramN & "=" & apiNumber
    With objHTTP
        .SetTimeouts 10000, 10000, 10000, 10000
        .Open "POST", url, False
        .setRequestHeader "User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
        .setRequestHeader "Content-type", "application/x-www-form-urlencoded"
        On Error Resume Next
        .send (sBody)
        If Err.Number = 0 Then
            If .Status = "200" Then
                html.body.innerHTML = .responseText
                Set GetPage = html
            Else
                Debug.Print "HTTP " & .Status & " " & .statusText
                Exit Function
            End If
        Else
            Debug.Print "Error " & Err.Number & " " & Err.Source & " " & Err.Description
            Exit Function
        End If
        On Error GoTo 0
    End With

End Function

Public Function GetNextURL(ByVal inputString As String)
    GetNextURL = Replace$(Replace$(Split(Split(inputString, "href=")(1), ">")(0), Chr$(34), vbNullString), "about:", vbNullString)
End Function

Public Sub AddHeaders(ByVal hTable As Object, ByVal startRow As Long, ByVal ws As Worksheet)
    Dim headers As Object, header As Object, columnCounter As Long
    Set headers = hTable.getElementsByTagName("th")
    For Each header In headers
        columnCounter = columnCounter + 1
        ws.Cells(startRow, columnCounter) = header.innerText
    Next header
End Sub

Public Sub WriteTables(ByVal hTable As HTMLTable, Optional ByVal startRow As Long = 1, Optional ByRef ws As Worksheet)
    If ws Is Nothing Then Set ws = ActiveSheet

    Dim tRow As Object, tCell As Object, tr As Object, td As Object, r As Long, c As Long
    r = startRow
    With ActiveSheet
        Set tRow = hTable.getElementsByTagName("tr")
        For Each tr In tRow
            Set tCell = tr.getElementsByTagName("td")
            For Each td In tCell
                .Cells(r, c).Value = td.innerText
                c = c + 1
            Next td
            r = r + 1:  c = 1
        Next tr
    End With
End Sub

Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
    With ws
        GetLastRow = .Cells(.Rows.Count, columnNumber).End(xlUp).Row
    End With
End Function

参考:

VBE>工具>引用> HTML对象库。