将网页抓取到Excel文件中

时间:2018-11-06 05:28:41

标签: excel vba excel-vba web-scraping

对于明天的美国大选,我希望能够在Excel文件中查看实时结果,并对众议院的投票数字进行一些分析。我想获得的数字来自Washington Post,那里有一个针对所有50个州的链接https://www.washingtonpost.com/election-results/STATE_NAME。我想获得的相关数字是每个民主党/共和党候选人获得的投票数辖区报告的百分比。这些数字将进入电子表格,如下所示:

enter image description here

我猜想获取数据的正确方法是,根据数字(B列)搜索地区,以获取百分比报告,然后使用名称查找投票数。但是,我不知道如何在Excel中实现该功能,而无需复制和粘贴数据即可自动更新。

1 个答案:

答案 0 :(得分:3)

假设您拥有的订阅使您可以跳过订阅选项页面,则可以使用以下内容(我在将页面HTML插入到桌面上的HTML文档中的位置添加了注释行-这意味着我没有可能会遇到免费访问限制的风险,而不必处理订阅选项页面)

Option Explicit

Public Sub GetInfo()
    Dim IE As New InternetExplorer
    With IE
        .Visible = True
        .navigate "https://www.washingtonpost.com/election-results/florida/?noredirect=on&utm_term=.70d691c0a9e2"

        While .Busy Or .readyState < 4: DoEvents: Wend
        '
        ''What ever you need to do for subscription based. I am assuming you have one and do not need to negotiate free limited access.

        Dim html As HTMLDocument
        'Set html = GetHTMLFileContent("C:\Users\User\Desktop\test.html") '<== I read HTML in from desktop here. 

        Dim districts As Object, districtsCount As Long, arr(), hTable As HTMLTable, i As Long
        Set districts = .document.querySelectorAll(".tiling-results-wrapper #elections-code-root")
        'Set districts = html.querySelectorAll(".tiling-results-wrapper #elections-code-root .wpe-result") '<== When using IE to retrieve webpage
        districtsCount = districts.Length

        Dim tr As Object, td As Object, r As Long, c As Long, header As Long, headers()
        headers = Array("District#", "Candidates", "Votes", "Pct")
        ReDim arr(1 To 1000, 1 To 4)

        For i = 0 To districts.Length - 1

            Set hTable = districts.item(i)
            header = 1

            For Each tr In hTable.getElementsByTagName("tr")
                r = r + 1: c = 2
                If Not header = 1 Then
                    arr(r, 1) = "District " & i + 1
                    For Each td In tr.getElementsByTagName("td")
                        arr(r, c) = Replace$(td.innerText, "–", "-")
                        c = c + 1
                    Next
                End If
                header = header + 1
            Next
        Next

        arr = Application.Transpose(arr)
        ReDim Preserve arr(1 To 4, 1 To r)
        arr = Application.Transpose(arr)

        With ThisWorkbook.Worksheets("Sheet1")
            .Cells(1, 1).Resize(UBound(arr, 1), UBound(arr, 2)) = arr
            .Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
        End With

        .Quit
    End With
End Sub

参考(VBE>工具>参考):

  1. Microsoft HTML对象库
  2. Microsoft Internet控件

示例输出:

enter image description here