网页抓取ETF每日数据VBA

时间:2018-10-24 15:01:49

标签: excel vba excel-vba web-scraping yahoo-finance

我正在尝试通过网络收集一些不同ETF的每日信息。我发现https://www.marketwatch.com/有准确的信息。 最相关的信息是ETF的公开价格,流通股,资产净值,总资产。 这是IVV美国股票的链接:https://www.marketwatch.com/investing/fund/ivv

我以前用VBA抓取过Web,但是我使用的页面的HTML不同,我不知道这是否是因为ETF的某些值(例如价格和交易量)不断变化。 这个想法是创建一个代码以提取相关信息,并创建一个数据库以使用ETF作为国家,地区等之间的流动的市场指标来分析宏观经济因素。

第一种方法是使用VBA,但是在我了解了更多数据之后,我想使用Python(在我对它有了更多的了解之后)尝试每天自动进行网络抓取过程。

我愿意接受任何建议或任何其他有用的网站(我曾与Yahoo Finance和Morningstar进行过尝试,但HTML代码也遇到了同样的问题)。

这是我的不良代码:

Sub Get_Data()

    Dim ticker As String, enlace As String

    ticker = ThisWorkbook.Worksheets("ETFs").Cells(2, 2).Value 'IVV
    'link = "https://www.morningstar.com/etfs/arcx/" & ticker & "/quote.html"
    'link = "https://finance.yahoo.com/quote/" & ticker & "?p=" & ticker
    link = "https://www.marketwatch.com/investing/fund/" & ticker

    Application.ScreenUpdating = False

    Dim x As Integer
    x = ThisWorkbook.Worksheets("ETFs").Cells(Rows.Count, 1).End(xlUp).Row

    'Dim i As Integer
    'For i = 2 To x

    Dim total_net_assets As Variant, open_price As Variant, NAV As Variant, shares_out

    Set ie = CreateObject("InternetExplorer.application")
    With ie
        .Visible = False
        .navigate link
        While .Busy Or .readyState < 4: DoEvents: Wend
            Do
                DoEvents
                On Error Resume Next
                ' Here is where I get the problem of not knowing how to reference the values I need because the class name appears repeatedly
                total_net_assets = .document.getElementsByClassName("").Value
                open_price = .document.getElementByClassName("price").Value
                NAV = .document.getElementByClassName("").Value
                shares_out = .document.getElementByClassName("kv__value kv__primary ").Value
                On Error GoTo 0
            Loop
    End With
    ThisWorkbook.Worksheets("ETFs").Cells(2, 13).Value = total_net_assets
    ThisWorkbook.Worksheets("ETFs").Cells(2, 14).Value = NAV
    ThisWorkbook.Worksheets("ETFs").Cells(2, 15).Value = open_price
    ThisWorkbook.Worksheets("ETFs").Cells(2, 16).Value = shares_out
    ie.Quit
    'Next i
    Application.ScreenUpdating = True

End Sub

2 个答案:

答案 0 :(得分:3)

访问方式:

我使用XMLHTTP请求的速度比打开IE快得多。

代码注释:

以下内容从A2开始,从Sheet1列A中读取基金短代码到一个数组中。您可以轻松地扩展此操作,将更多资金添加到A列中。

此数组通过将资金代码串联到BASE_URL变量中来循环发出XMLHTTP请求。

我使用类clsHTTP来保持XMLHTTP对象高效-无需继续创建和销毁该对象。

我为此类提供了两种方法。一种用于检索目标页面innerHTML(GetString),另一种用于提取所需信息(如果可用)(GetInfo)。我使用字典来测试是否存在搜索到的标签。如果存在,我将获取相关的值。如果没有,我在字典中有一个占位符vbNullString

我将每个抓取的结果添加到名为results的集合中。最后,我将此写出循环到工作表。通过将大部分工作保留在内存中,可以加快抓取速度。


从HTML检索信息:

labels,例如Openvalues成对出现。

您可以通过使用nodeList方法来应用class CSS selector来生成getElementsByClassName(与querySelectorAll一样的思考集合),以通过其类名{{ 1}}。 kv__label是类选择器。

"."

您执行相同的操作以获取关联的值:

Set labels = .querySelectorAll(".kv__label") '<== nodeList of labels

您可以在Set values = .querySelectorAll(".kv__value.kv__primary") '<== nodeList of associated values. Same length as labels nodeList so can use same index to retrieve associated label/value pairs from each nodeList. 方法clsHTTP中使用字典循环标签,以查看是否搜索到标签,如果存在,则通过使用与索引相同的索引从值中检索关联的值在.GetInfo nodeList中找到标签的位置,并且该标签的字典labels值将使用实际检索到的值进行更新,否则将保留为vbNullString。 / p>


样本结果:

enter image description here

VBA:

类模块clsHTTP:

vbNullString

标准模块1:

Option Explicit
Private http As Object

Private Sub Class_Initialize()
    Set http = CreateObject("MSXML2.XMLHTTP")
End Sub

Public Function GetString(ByVal url As String) As String
    Dim sResponse As String
    With http
        .Open "GET", url, False
        .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
        .send
        sResponse = StrConv(.responseBody, vbUnicode)
        GetString = sResponse
    End With
End Function

Public Function GetInfo(ByVal html As HTMLDocument) As Object
    Dim dict As Object, i As Long
    Set dict = CreateObject("Scripting.Dictionary")
    dict.Add "Open", vbNullString
    dict.Add "Shares Outstanding", vbNullString
    dict.Add "Total Net Assets", vbNullString
    dict.Add "NAV", vbNullString

    Dim values As Object, labels As Object

    With html
        Set values = .querySelectorAll(".kv__value.kv__primary")
        Set labels = .querySelectorAll(".kv__label")

        For i = 0 To labels.Length - 1
            If dict.Exists(labels.item(i).innerText) Then dict(labels.item(i).innerText) = values.item(i).innerText
        Next
    End With
    Set GetInfo = dict
End Function

设置:

enter image description here


不使用类:

Option Explicit   
Public Sub GetFundInfo()
    Dim sResponse As String, html As HTMLDocument, http As clsHTTP, i As Long
    Dim headers(), funds(), url As String, results As Collection, ws As Worksheet
    Const BASE_URL As String = "https://www.marketwatch.com/investing/fund/"

    Application.ScreenUpdating = False

    headers = Array("Open", "Shares Outstanding", "Total Net Assets", "NAV")
    Set results = New Collection
    Set http = New clsHTTP
    Set ws = ThisWorkbook.Worksheets("Sheet1")
    Set html = New HTMLDocument

    funds = Application.Transpose(ws.Range("A2:A3").Value) '<== Change the range here to the single column range containing your dotNums.

    For i = LBound(funds) To UBound(funds)
        If Not IsEmpty(funds(i)) Then
            url = BASE_URL & funds(i)
            html.body.innerHTML = http.GetString(url)
            results.Add http.GetInfo(html).Items
        End If
    Next

    If results.Count > 0 Then
        Dim item As Variant, r As Long, c As Long
        r = 2: c = 2
        With ws
            .Cells(1, c).Resize(1, UBound(headers) + 1) = headers
            For Each item In results
                .Cells(r, c).Resize(1, UBound(item) + 1) = item
                r = r + 1
            Next
        End With
    End If
    Application.ScreenUpdating = True
End Sub

答案 1 :(得分:1)

好的,因此您将需要创建两个循环。您可以继续为所需的每个价格点重复使用elem0elem1elemColl(1)变量-只需确保将每个新迭代的bFoundIt重置为False即可,不要提早退出For Loops。

对于您的total_net_assets变体,您首先循环kv__item的类。然后,您需要在kv__label的元素内循环kv__item的每个类集合,并在匹配innerText:总净资产时停止。匹配后,将使用第一个coll obj elem0为其获取kv__value kv__primary类名。

Dim IE As Object, elem0 As Object, elem1 As Object, i As Long, bFoundIt As Boolean

Set IE = CreateObject("InternetExplorer.application")
With IE
    .Visible = False
    .navigate link
    While .Busy Or .readyState < 4: DoEvents: Wend
        DoEvents
        bFoundIt = False
        For Each elem0 In .document.getElementsByClassName("kv__item")
            For Each elem1 In elem0.getElementsByClassName("kv__label")
                If elem1.innerText = "Total Net Assets" Then
                    bFoundIt = True
                    total_net_assets = elem0.getElementsByClassName("kv__value kv__primary ")(0).innerText
                    Exit For
                End If
            Next elem1
            If bFoundIt Then Exit For
        Next elem0