从网页下载PDF文件

时间:2019-11-23 18:26:20

标签: vb.net xpath web-scraping web-crawler

我正在尝试从网站下载文件。我当前的解决方案似乎可以正常工作,但是有些事情我还是不明白。

第一个问题出现在以下时间:

//div[@class='large-4 medium-4 columns']//a

还有其他类为large-4 medium-4 columns的div。所以我得到了几个不必要的链接。如何摆脱它们?我只需要包含/products/

的页面

第二个问题是什么都没有下载到C:\temp\,我想其中有一些东西:

//div[@class='large-6 medium-8 columns large-centered']/a[string-length(@href)>0]

但是有什么问题吗?

“ xxx”是我的代码中的链接,应该是

Imports HtmlAgilityPack

Module Module1

    Sub Main()
        Dim mainUrl As String = "xxx"
        Dim htmlDoc As HtmlDocument = New HtmlWeb().Load(mainUrl) '< - - - Load the webage into htmldocument

        Dim listLinks As New List(Of String)

        Dim srcs As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//div[@class='large-4 medium-4 columns']//a") '< - - - select nodes with links
        For Each src As HtmlNode In srcs

            ' Store links in array
            listLinks.Add(src.Attributes("href").Value)

            Console.WriteLine(src.Attributes("href").Value)

        Next

        Console.Read()

        For Each productLink As String In listLinks
            Dim prodDoc As HtmlDocument = New HtmlWeb().Load(productLink)

            Dim scrapedsrcs As HtmlNodeCollection = prodDoc.DocumentNode.SelectNodes("//div[@class='large-6 medium-8 columns large-centered']/a[string-length(@href)>0]") '< - - - select nodes with links

            If scrapedsrcs IsNot Nothing Then
                For Each scrapedlink As HtmlNode In scrapedsrcs
                    ' Show links in console
                    'Console.WriteLine($"-- {scrapedlink.Attributes("href").Value}") '< - - - Print urls  

                    My.Computer.Network.DownloadFile(scrapedlink.Attributes("href").Value, "C:\temp\" & System.IO.Path.GetFileName(scrapedlink.Attributes("href").Value) & ".pdf")
                Next
            End If
        Next

        Console.Read()

        ' End of scraping

    End Sub

End Module

编辑:

好吧,第一个应该是

//div[@class='row inset1 productItem padb1 padt1']/div[@class='large-4 medium-4 columns']//a

1 个答案:

答案 0 :(得分:1)

这会将小册子下载到运行应用程序的文件夹中:

    Dim htmlDoc As HtmlDocument = New HtmlWeb().Load("https://webpage.com")
    Dim ProductListPage As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//div[@class='productContain padb6']//div[@class='large-4 medium-4 columns']/a")
    For Each src As HtmlNode In ProductListPage
        htmlDoc = New HtmlWeb().Load(src.Attributes("href").Value)
        Dim LinkTester As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//div[@class='row padt6 padb4']//a")
        If LinkTester IsNot Nothing Then
            For Each dllink In LinkTester
                Dim LinkURL As String = dllink.Attributes("href").Value
                Console.WriteLine(LinkURL)

                Dim ExtractFilename As String = LinkURL.Substring(LinkURL.LastIndexOf("/"))
                Dim DLClient As New WebClient
                DLClient.DownloadFileAsync(New Uri(LinkURL), ".\" & ExtractFilename)
            Next
        End If
    Next