从Morningstar中提取特定的表格单元格,然后循环到下一个Morningstar页面

时间:2019-07-01 16:33:27

标签: excel vba web-scraping queryselector

我目前正在尝试从Morningstar的一张表中抓取某些数据,然后将其循环到下一个代码,并重复进行直到没有更多的代码为止。

当前,它将拉动“追踪总回报”表上的整个“类别内排名”行。我只是想拉3个月,6个月,年初至今,1年,3年和5年。完成这些操作后,它将循环到导航行中“ Cells(p,14)”所确定的下一个股票行情。

即。它检测到“ LINKX”位于第1、14单元格中,因此导航到http://performance.morningstar.com/fund/performance-return.action?t=LINKX&region=usa&culture=en_US,并从“跟踪总收益”表中拉出所有“类别等级”行。我只希望将指定的那些放入指定的单元格位置,然后循环到下一个自动收录器。

我已经使用excel VBA浏览了许多这些线程,我试图从某个行情指示器页面中提取关键的特定信息,然后循环到下一个行情指示器并重复。

Declare PtrSafe Function apiShowWindow Lib "user32" Alias "ShowWindow" _
        (ByVal hwnd As LongPtr, ByVal nCmdShow As LongPtr) As LongPtr
    Global Const SW_MAXIMIZE = 3
    Global Const SW_SHOWNORMAL = 1
    Global Const SW_SHOWMINIMIZED = 2

Sub LinkedInWebScrapeScript()

    Dim objIE As InternetExplorer

    Dim html As HTMLDocument

    Set objIE = GetObject("new:{D5E8041D-920F-45e9-B8FB-B1DEB82C6E5E}")
    objIE.Visible = 1
Dim p As Integer
p = 3

    objIE.navigate ("http://performance.morningstar.com/fund/performance-return.action?t=" & Cells(p, 14) & "&region=usa&culture=en_US")
    Application.Wait Now + #12:00:02 AM#

    While objIE.Busy
        DoEvents
    Wend
    apiShowWindow objIE.hwnd, SW_MAXIMIZE

    For i = 1 To 2
        objIE.document.parentWindow.scrollBy 0, 100000 & i
        Application.Wait Now + #12:00:01 AM#
    Next i

Dim TDelements As IHTMLElementCollection
Dim htmldoc As MSHTML.IHTMLDocument 'Document object
Dim eleColtr As MSHTML.IHTMLElementCollection 'Element collection for tr tags
Dim eleColtd As MSHTML.IHTMLElementCollection 'Element collection for td tags
Dim eleColtd1 As MSHTML.IHTMLElementCollection
Dim eleRow As MSHTML.IHTMLElement 'Row elements
Dim eleCol As MSHTML.IHTMLElement 'Column elements
Set htmldoc = objIE.document 'Document webpage
Set eleColtr = htmldoc.getElementsByTagName("tr") 'Find all tr tags
Set TDelements = htmldoc.getElementsByTagName("table")
'This section populates Excel
i = 0 'start with first value in tr collection


Set eleColtd = htmldoc.getElementsByClassName("r_table3 width955px print97")(0).getElementsByClassName("last")(0).getElementsByClassName("row_data divide") 'get all the td elements in that specific tr

    For Each eleCol In eleColtd 'for each element in the td collection
        Sheets("Sheet2").Range("A1").Offset(i, j).Value = eleCol.innerText 'paste the inner text of the td element, and offset at the same time
        j = j + 1 'move to next element in td collection
    Next eleCol 'rinse and repeat
i = i + 1

p = p + 1

objIE.navigate ("http://performance.morningstar.com/fund/performance-return.action?t=" & Cells(p, 14) & "&region=usa&culture=en_US")

Set eleColtd = htmldoc.getElementsByClassName("r_table3 width955px print97")(0).getElementsByClassName("last")(0).getElementsByClassName("row_data divide") 'get all the td elements in that specific tr

    For Each eleCol In eleColtd 'for each element in the td collection
        Sheets("Sheet2").Range("A1").Offset(i, j).Value = eleCol.innerText 'paste the inner text of the td element, and offset at the same time
        z = z + 1
        j = j + 1 'move to next element in td collection
    Next eleCol 'rinse and repeat


End Sub

这将拉动“追踪总回报”表上的整个“类别中的排名”行。我只是想拉3个月,6个月,年初至今,1年,3年和5年。完成这些操作后,它将循环到导航行中“ Cells(p,14)”所确定的下一个股票行情。

1 个答案:

答案 0 :(得分:2)

以下内容显示了一个循环以及如何选择适当的表,然后使用css selectors来填充表单元格。代码从第1行的第N列读取到一个数组中。它假设范围内没有空白单元格(尽管可以确定要添加一个测试)。

在数组上有一个循环,其中包含每个报价器,URL中的TICKER占位符将替换为当前报价器值。

在“每月显示”标签上单击一行。

通过以下行标识适当的行

Set rankings = .querySelectorAll("#tab-month-end-content .last td")

#tab-month-end-content是一个ID选择器,它会在右侧显示标签,然后.last是最后一个tbody(即last)的类名的类选择器,然后td用于指定该tbody中的子td单元。


CSS选择器:

现代浏览器针对CSS进行了优化。 CSS选择器是匹配html文档中元素的一种快速方法。 CSS选择器是通过querySelector或querySelectorAll方法应用的;在这种情况下,为HTMLDocument(即文档)。 querySelector返回一个节点:css选择器的第一个匹配项; querySelectorAll返回所有匹配项的nodeList-然后索引该nodeList以获取特定项,例如第二个td单元格位于索引1。

查看我们指定的模式:

#tab-month-end-content .last td

第一部分是id selector#,它通过ID选择一个元素

#tab-month-end-content

应用于页面时,这将返回两个匹配项,我们需要第二个匹配项

点击图片放大

enter image description here

下一部分

.last 

是类名.的{​​{3}},last。这将选择上图所示的tbody标签子元素。由于只有第二个id匹配的元素有这个孩子,所以我们现在使用正确的父元素继续工作,并使用class selector选择td类型的元素

td

上述每个部分之间的空白被称为type selector,如果它们的祖先元素与第一个选择器匹配,则它们指定选择第二个选择器匹配的元素,即左边的选择器是右边的相邻css选择器检索到的与选择器匹配的元素的父级。

我们可以在下一张图片中看到它:

点击图片放大

descendant combinators


VBA:

Option Explicit
Public Sub GetData()
    Dim ie As Object, tickers(), ws As Worksheet, lastRow As Long
    Dim results(), headers(), r As Long, i As Long, url As String

    headers = Array("ticker", "3m", "6m", "ytd", "1y", "3y", "6y")
    Set ws = ThisWorkbook.Worksheets("Sheet1")
    tickers = Application.Transpose(ws.Range("N1:N" & GetLastRow(ws, 14)).Value)
    ReDim results(1 To UBound(tickers), 1 To UBound(headers) + 1)
    Set ie = GetObject("new:{D5E8041D-920F-45e9-B8FB-B1DEB82C6E5E}")
    With ie
        .Visible = True
        For i = LBound(tickers) To UBound(tickers)
            r = r + 1
            url = Replace$("http://performance.morningstar.com/fund/performance-return.action?t=TICKER&region=usa&culture=en_US", "TICKER", tickers(i))
            .Navigate2 url

            While .Busy Or .readyState < 4: DoEvents: Wend

            .document.querySelector("[tabname='#tabmonth']").Click

            Dim rankings As Object
            Do
            Loop While .document.querySelectorAll("#tab-month-end-content .last td").Length = 0 'could add timed loop here

            With .document
                Set rankings = .querySelectorAll("#tab-month-end-content .last td")
                On Error Resume Next
                results(r, 1) = tickers(i)
                results(r, 2) = rankings.item(1).innerText
                results(r, 3) = rankings.item(2).innerText
                results(r, 4) = rankings.item(3).innerText
                results(r, 5) = rankings.item(4).innerText
                results(r, 6) = rankings.item(5).innerText
                results(r, 7) = rankings.item(6).innerText
                On Error GoTo 0
            End With
            Set rankings = Nothing
        Next
        ws.Cells(1, 15).Resize(UBound(results, 1), UBound(results, 2)) = results
        .Quit
    End With
End Sub

Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
    With ws
        GetLastRow = .Cells(.rows.Count, columnNumber).End(xlUp).Row
    End With
End Function

如@SIM所述,您可以使用enter image description here并避免使用浏览器,尽管不确定您的安全设置是否需要将网站列入白名单。您需要在以下网址中查询占位符是否有效:XNAS:TICKERXNAS前缀可能会随您的代码而有所不同,在这种情况下,您将需要适当的字符串(包括N列中的前缀),然后将扩展名占位符替换为例如。 ..... =PLACEHOLDER&region ..

Option Explicit
Public Sub GetData()
    Dim tickers(), ws As Worksheet, lastRow As Long
    Dim results(), headers(), r As Long, i As Long, url As String, html As HTMLDocument
    Set html = New HTMLDocument 'vbe > tools > references > Microsoft HTML Object Library

    headers = Array("ticker", "3m", "6m", "ytd", "1y", "3y", "6y")
    Set ws = ThisWorkbook.Worksheets("Sheet1")
    tickers = Application.Transpose(ws.Range("N1:N" & GetLastRow(ws, 14)).Value)
    ReDim results(1 To UBound(tickers), 1 To UBound(headers) + 1)

    With CreateObject("MSXML2.XMLHTTP")

        For i = LBound(tickers) To UBound(tickers)
            r = r + 1
            url = Replace$("http://performance.morningstar.com/perform/Performance/fund/trailing-total-returns.action?&t=XNAS:TICKER&region=usa&culture=en-US&cur=&ops=clear&s=0P0000J533&ndec=2&ep=true&align=m&annlz=true&comparisonRemove=false&loccat=&taxadj=&benchmarkSecId=&benchmarktype=", "TICKER", tickers(i))
           .Open "GET", url, False
           .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
           .setRequestHeader "DNT", "1"
           .send
           html.body.innerHTML = .responseText

            Dim rankings As Object
            With html
                Set rankings = .querySelectorAll(".last td")

                On Error Resume Next
                results(r, 1) = tickers(i)
                results(r, 2) = rankings.item(1).innerText
                results(r, 3) = rankings.item(2).innerText
                results(r, 4) = rankings.item(3).innerText
                results(r, 5) = rankings.item(4).innerText
                results(r, 6) = rankings.item(5).innerText
                results(r, 7) = rankings.item(6).innerText
                On Error GoTo 0
            End With
            Set rankings = Nothing
        Next
        ws.Cells(1, 15).Resize(UBound(results, 1), UBound(results, 2)) = results
    End With
End Sub

Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
    With ws
        GetLastRow = .Cells(.rows.Count, columnNumber).End(xlUp).Row
    End With
End Function