浏览链接并下载PDF

时间:2019-11-10 22:13:32

标签: excel vba web-scraping web-crawler

我有一段代码在这里已经有一段时间了,涉及不同类型的问题。它越来越接近最终版本。但是,现在我有一个问题,就是代码中有错误,部分代码无法正常工作。

这个想法是浏览链接并获取PDF文件。链接已存储在sLinks中,请参见带有注释的行“检查链接是否存储在sLink中”。代码继续前进,文件被存储在C:\temp\中,但是在文件夹中有12个PDF之后,我得到一个错误,调试器指向xHttp.Open "GET", sLink

enter image description here

我浏览了PDF,看起来好像所有文件都被下载了……因为在几页上有一些相同,而且至少在两页上有一个Policy PDF。这就是为什么有17个链接和12个文件的原因。无论如何,为什么会引发错误?

enter image description here

可能是什么问题?

这是我的代码:

Sub DownloadFiles()
    Dim xHttp       As Object: Set xHttp = CreateObject("Microsoft.XMLHTTP")
    Dim hDoc        As MSHTML.HTMLDocument
    Dim Anchors     As Object
    Dim Anchor      As Variant
    Dim sPath       As String
    Dim wholeURL    As String

    Dim internet As InternetExplorer
    Dim internetdata As HTMLDocument
    Dim internetlink As Object
    Dim internetinnerlink As Object
    Dim arrLinks As Variant
    Dim sLink As String
    Dim iLinkCount As Integer
    Dim iCounter As Integer
    Dim sLinks As String

    Set internet = CreateObject("InternetExplorer.Application")
    internet.Visible = False
    internet.navigate ("https://www.nordicwater.com/products/waste-water/")

        Do While internet.Busy
          DoEvents
        Loop
        Do Until internet.readyState = READYSTATE_COMPLETE
            DoEvents
        Loop

        Set internetdata = internet.document
        Set internetlink = internetdata.getElementsByTagName("a")

        i = 1

        For Each internetinnerlink In internetlink
            If Left$(internetinnerlink, 36) = "https://www.nordicwater.com/product/" Then

                sLinks = sLinks & internetinnerlink.href & vbCrLf
                i = i + 1

            Else
            End If

    ThisWorkbook.Worksheets("Sheet1").range("B1").Value = sLinks ' Check that links are stored in sLinks

    Next internetinnerlink

    wholeURL = "https://www.nordicwater.com/"
    sPath = "C:\temp\"

    arrLinks = Split(sLinks, vbCrLf)
    iLinkCount = UBound(arrLinks) + 1

    For iCounter = 1 To iLinkCount
    sLink = arrLinks(iCounter - 1)
        'Get the directory listing
        xHttp.Open "GET", sLink ' DEBUGGER IS POINTING HERE
        xHttp.send

        'Wait for the page to load
        Do Until xHttp.readyState = 4
            DoEvents
        Loop

        'Put the page in an HTML document
        Set hDoc = New MSHTML.HTMLDocument
        hDoc.body.innerHTML = xHttp.responseText

        'Loop through the hyperlinks on the directory listing
        Set Anchors = hDoc.getElementsByTagName("a")

        For Each Anchor In Anchors

            'test the pathname to see if it matches your pattern
            If Anchor.pathname Like "*.pdf" Then

                xHttp.Open "GET", wholeURL & Anchor.pathname, False
                xHttp.send

                With CreateObject("Adodb.Stream")
                    .Type = 1
                    .Open
                    .write xHttp.responseBody
                    .SaveToFile sPath & getName(wholeURL & Anchor.pathname), 2 '//overwrite
                End With

            End If

        Next

    Next

End Sub

用于在链接外生成文件名的功能:

Function getName(pf As String) As String
    getName = Split(pf, "/")(UBound(Split(pf, "/")))
End Function

编辑:

我已经解决了第一期。 arrLinks = Split(p_sLinks, vbCrLf)更改为arrLinks = Split(sLinks, vbCrLf)。现在我面临另一个问题。

1 个答案:

答案 0 :(得分:1)

我将在调用HTTP GET之前添加If Len(sLink) > 0检查。

问题出在这一行:

sLinks = sLinks & internetinnerlink.href & vbCrLf

它将在sLinks列表的末尾添加一个额外的vbCrLf。应该是:

If sLinks <> "" Then sLinks = sLinks & vbCrLf
sLinks = sLinks & internetinnerlink.href

这样,最后一个链接之后就不会出现vbCrLf