如何从多个网页中提取列出的数据 - 无法找到表格标签

时间:2017-07-26 08:45:39

标签: excel vb.net web web-scraping webpage

首先,我已经在网上阅读了关于这个主题的不同答案,但我不得不承认我真的很难将它们调整到我需要的地方,所以请非常感谢任何帮助!

我需要提取以下网页(第1-7页)中列出的数据,即基金名称,价格,货币等https://toolkit.financialexpress.net/santanderam,并将此数据提取至excel。

我有以下代码将打开IE页面(正在运行):

' return the document containg the DOM of the page strWebAddress
' returns Nothing if the timeout lngTimeoutInSeconds was reached
Public Function GetIEDocument(ByVal strWebAddress As String, Optional ByVal lngTimeoutInSeconds As Long = 15) As MSHTML.HTMLDocument
Dim IE As SHDocVw.InternetExplorer
Dim IEDocument As MSHTML.HTMLDocument
Dim dateNow As Date

' create an IE application, representing a tab
Set IE = New SHDocVw.InternetExplorer

' optionally make the application visible, though it will work perfectly fine in the background otherwise
IE.Visible = True

' open a webpage in the tab represented by IE and wait until the main request successfully finished
' times out after lngTimeoutInSeconds with a warning
IE.Navigate strWebAddress
dateNow = Now
Do While IE.Busy
    If Now > DateAdd("s", lngTimeoutInSeconds, dateNow) Then Exit Function
Loop

' retrieve the webpage's content (that is, the HTML DOM) and wait until everything is loaded (images, etc.)
' times out after lngTimeoutInSeconds with a warning
Set IEDocument = IE.Document
dateNow = Now
Do While IEDocument.ReadyState <> "complete"
    If Now > DateAdd("s", lngTimeoutInSeconds, dateNow) Then Exit Function
Loop

Set GetIEDocument = IEDocument
End Function

但是我找不到包含我感兴趣的所有其他标记的表标记,以允许其余代码提取数据,下面的代码是我到目前为止的代码:

Public Sub GetTeamData()
Dim strWebAddress As String
Dim strH2AnchorContent As String
Dim IEDocument As MSHTML.HTMLDocument
Dim objH2 As MSHTML.HTMLHeaderElement
Dim objTable As MSHTML.HTMLTable
Dim objRow As MSHTML.HTMLTableRow
Dim objCell As MSHTML.HTMLTableCell
Dim lngRow As Long
Dim lngColumn As Long

' initialize some variables that should probably better be passed as paramaters or defined as constants
strWebAddress = "https://toolkit.financialexpress.net/santanderam"
strH2AnchorContent = "   "

' open page
Set IEDocument = GetIEDocument(strWebAddress)
If IEDocument Is Nothing Then
    MsgBox "Timeout reached opening this address:" & vbNewLine & strWebAddress, vbCritical
    Exit Sub
End If

' retrieve anchor element
For Each objH2 In IEDocument.getElementsByTagName("h2")
   If objH2.innerText = strH2AnchorContent Then Exit For
Next objH2
If objH2 Is Nothing Then
    MsgBox "Could not find """ & strH2AnchorContent & """ in DOM!", vbCritical
    Exit Sub
End If

' traverse HTML tree to desired table element
' * move up one element in the hierarchy
' * skip two elements to proceed to the third (interjected each time with whitespace that is interpreted as an element of its own)
' * move down two elements n the hierarchy
Set objTable = objH2.parentElement _
                    .NextSibling.NextSibling _
                    .NextSibling.NextSibling _
                    .NextSibling.NextSibling _
                    .Children(0) _
                   .Children(0)

 '   iterate over the table and output its contents
lngRow = 1
 For Each objRow In objTable.Rows
    lngColumn = 1
     For Each objCell In objRow.Cells
         Cells(lngRow, lngColumn) = objCell.innerText
        lngColumn = lngColumn + 1
     Next objCell
     lngRow = lngRow + 1
 Next
End Sub

我假设我可以在下面的行中找到正确的表格标签:

 strH2AnchorContent = "  "

那么上面会有用吗?如果是这样,任何人都可以帮助找到正确的标签或建议我上面的错误?

再次感谢任何帮助!

由于

修改1

更新的代码:

    ' open a webpage in the tab represented by IE and wait until the main request successfully finished
' times out after lngTimeoutInSeconds with a warning
IE.Navigate strWebAddress
dateNow = Now
Do While IE.Busy
    If Now > DateAdd("s", lngTimeoutInSeconds, dateNow) Then Exit Function
Loop

' retrieve the webpage's content (that is, the HTML DOM) and wait until everything is loaded (images, etc.)
' times out after lngTimeoutInSeconds with a warning
Set IEDocument = IE.Document
dateNow = Now
Do While IEDocument.ReadyState <> "complete"
    If Now > DateAdd("s", lngTimeoutInSeconds, dateNow) Then Exit Function
Loop

Set GetIEDocument = IEDocument
End Function

Public Sub GetTeamData()
Dim strWebAddress As String
Dim strH2AnchorContent As String
Dim IEDocument As MSHTML.HTMLDocument
Dim objH2 As MSHTML.HTMLHeaderElement
Dim obTable As MSHTML.HTMLTable
Dim objRow As MSHTML.HTMLTableRow
Dim objCell As MSHTML.HTMLTableCell
Dim lngRow As Long
Dim lngColumn As Long

' initialize some variables that should probably better be passed as paramaters or defined as constants
strWebAddress = "https://toolkit.financialexpress.net/santanderam"


' open page
Set IEDocument = GetIEDocument(strWebAddress)
If IEDocument Is Nothing Then
    MsgBox "Timeout reached opening this address:" & vbNewLine &     strWebAddress, vbCritical
    Exit Sub
End If

' retrieve anchor element
Set oTable = IEDocument.getElementById("Price_1_1")
Debug.Print oTable.innerText

' iterate over the table and output its contents
lngRow = 1
For Each objRow In oTable.Rows
    lngColumn = 1
    For Each objCell In objRow.Cells
        Cells(lngRow, lngColumn) = objCell.innerText
        lngColumn = lngColumn + 1
    Next objCell
    lngRow = lngRow + 1
Next
End Sub

1 个答案:

答案 0 :(得分:0)

您的代码运行正常,问题是您在加载表之前尝试从表中捕获数据。我添加了一个简单的 Wait 循环5秒钟,您当前的代码捕获了数据。以下是我在 Set oTable = IEDocument.getElementById("Price_1_1") 声明之前添加的循环:

dateNow = Now
bExitLoop = False
lngTimeoutInSeconds = 5
Do While Not bExitLoop
    If Now > DateAdd("s", lngTimeoutInSeconds, dateNow) Then Exit Do
Loop

上面的代码是静态的5秒等待。你可以让它变得更有活力......我会把它留在那里作为脑筋急转弯:)