使用Excel VBA检索网页中的所有Excel文件链接

时间:2018-11-23 08:45:33

标签: html excel vba excel-vba web-scraping

我正在尝试从网站上获取所有可下载的Excel文件链接,但遇到了困难。请帮助指导我。谢谢。

Sub TYEX()

    Dim internet As Object
    Dim internetdata As Object
    Dim div_result As Object
    Dim header_links As Object
    Dim link As Object
    Dim URL As String

    Set internet = CreateObject("InternetExplorer.Application")
    internet.Visible = True

    URL = "https://www.jpx.co.jp/markets/public/short-selling/index.html"
    internet.Navigate URL

    Do Until internet.ReadyState >= 4
        DoEvents
    Loop

    Application.Wait Now + TimeSerial(0, 0, 5)

    Set internetdata = internet.Document
    Set div_result = internetdata.getElementById("readArea")

    Set header_links = div_result.getElementsByTagName("td")

    For Each h In header_links
        Set link = h.ChildNodes.item(0)
        Cells(Range("A" & Rows.Count).End(xlUp).Row + 1, 1) = link.href
    Next

    MsgBox "done"
End Sub

2 个答案:

答案 0 :(得分:4)

您的想法正确无误,但这是另一种方法:

Sub TYEX()

    Dim ie As Object
    Set ie = CreateObject("InternetExplorer.Application")
    With ie
        .navigate "https://www.jpx.co.jp/markets/public/short-selling/index.html"
        .Visible = True

        Do While .Busy Or .readyState < 4
            DoEvents
        Loop

        Dim doc As Object, tbl As Object
        Set doc = .document
        Set tbl = doc.getElementsByClassName("component-normal-table")(0).Children(0)

        Dim r As Long, xlsArr(), a As Object

        With tbl.Rows
            ReDim xlsArr(1 To .Length - 1)
            For r = 1 To .Length - 1   ' 0 is the table header
                xlsArr(r) = .Item(r).Children(1).innerHTML
            Next r
        End With

        With CreateObject("VBScript.RegExp")
            .Pattern = "<a href=""(\/markets.*?\.xls)"
            For r = 1 To UBound(xlsArr)
                xlsArr(r) = "https://www.jpx.co.jp" & .Execute(xlsArr(r))(0).SubMatches(0)
                Debug.Print xlsArr(r)
            Next
        End With
    End With

    'Add to sheet
    Dim ws As Worksheet, rng As Range
    Set ws = ThisWorkbook.Worksheets(1)
    With ws
        Set rng = .Range(.Cells(NextRow(ws), 1), .Cells(NextRow(ws) + UBound( _
                xlsArr) - 1, 1))
        rng.Value = Application.Transpose(xlsArr)
    End With

End Sub

Public Function NextRow(ByVal ws As Worksheet, Optional ByVal col As Variant = 1) As Long
    With ws
        NextRow = .Cells(.Rows.Count, col).End(xlUp).Row + 1
    End With
End Function

破坏代码

这将循环您的html表行。我们从1开始,因为0实际上只是表头。

With tbl.Rows
    ReDim xlsArr(1 To .Length - 1)
    For r = 1 To .Length - 1   ' 0 is the table header
        xlsArr(r) = .Item(r).Children(1).innerHTML
    Next r
End With

这使用正则表达式从innerHTML属性中提取网址。您可以在此处查看此特定正则表达式的工作方式:Regex101

With CreateObject("VBScript.RegExp")
    .Pattern = "<a href=""(\/markets.*?\.xls)"
    For r = 1 To UBound(xlsArr)
        xlsArr(r) = "https://www.jpx.co.jp" & .Execute(xlsArr(r))(0).SubMatches(0)
        Debug.Print xlsArr(r)
    Next
End With

您将范围的大小设置为与包含链接的数组相同的大小,然后将该数组写入工作表。通常,这比一一写入单元格要快得多。

'Add to sheet
Dim ws As Worksheet, rng As Range
Set ws = ThisWorkbook.Worksheets(1)
With ws
    Set rng = .Range(.Cells(NextRow(ws), 1), .Cells(NextRow(ws) + UBound( _
            xlsArr) - 1, 1))
    rng.Value = Application.Transpose(xlsArr)
End With

答案 1 :(得分:3)

您可以将attribute = value CSS selector$运算符一起使用,以表示href值必须以.xls结尾。然后使用querySelectorAll检索所有匹配的结果。使用CSS选择器是一种非常快速且通常可靠的方法。

Dim list As Object
Set list = ie.document.querySelectorAll("[href$='.xls']")

使用XMLHTTP比打开IE更快。请注意,然后您可以将这些链接传递给执行二进制下载的函数或URLMon进行下载。

Option Explicit   
Public Sub Links()
    Dim sResponse As String, html As HTMLDocument, list As Object, i As Long

    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.jpx.co.jp/markets/public/short-selling/index.html", False
        .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
        .send
        sResponse = StrConv(.responseBody, vbUnicode)
    End With

    Set html = New HTMLDocument
    With html
        .body.innerHTML = sResponse
        Set list = html.querySelectorAll("[href$='.xls']")
    End With
    For i = 0 To list.Length - 1
        Debug.Print Replace$(list.item(i), "about:", "https://www.jpx.co.jp")
    Next
End Sub

示例下载功能(尽管您可以重复使用现有的XMLHTTP对象-这只是为了说明):

Public Function DownloadFile(ByVal downloadFolder As String, ByVal downloadURL As String) As String
    Dim http As Object , tempArr As Variant
    Set http =  CreateObject("WinHttp.WinHttpRequest.5.1")
    http.Open "GET", downloadURL, False
    http.send
    On Error GoTo errhand
    With CreateObject("ADODB.Stream")
        .Open
        .Type = 1
        .write http.responseBody
        tempArr = Split(downloadURL, "/")
        tempArr = tempArr(UBound(tempArr))
        .SaveToFile downloadFolder & tempArr, 2  '< "/" on enter of downloadFolder. 2 for overwrite which is Ok if no file modifications.
        .Close
    End With
    DownloadFile = downloadFolder & tempArr
    Exit Function
errhand:
    If Err.Number <> 0 Then
        Debug.Print Err.Number, Err.Description
        MsgBox "Download failed"
    End If
    DownloadFile = vbNullString
End Function

参考(VBE>工具>参考):

  1. Microsoft HTML对象库