获取所有包含“ / product /”的链接

时间:2019-11-23 12:09:29

标签: vb.net web-scraping web-crawler html-agility-pack

我想获取所有包含/product/的链接。有17个链接包含/product/。该怎么做?

这行似乎有问题

Dim srcs = From iframeNode In htmlDoc.DocumentNode.SelectNodes("//a[@href]")
                       Select iframeNode.Attributes("href").Value

如何添加参数以按/product/进行过滤?

这是我到目前为止所拥有的:

Imports HtmlAgilityPack

Module Module1

    Sub Main()
        Dim mainUrl As String = "https://www.nordicwater.com/products/waste-water/"
        Dim htmlDoc As New HtmlAgilityPack.HtmlDocument

        htmlDoc.LoadHtml(mainUrl)

        Dim srcs = From iframeNode In htmlDoc.DocumentNode.SelectNodes("//a[@href]")
                   Select iframeNode.Attributes("href").Value

        'print all the src you got
        For Each src In srcs
            Console.WriteLine(src)
        Next
    End Sub

End Module

编辑:

工作解决方案:

    Imports HtmlAgilityPack

    Module Module1

        Sub Main()
            Dim mainUrl As String = "https://www.nordicwater.com/products/waste-water/"
            Dim htmlDoc As HtmlDocument = New HtmlWeb().Load(mainUrl) '< - - - Load the webage into htmldocument

            Dim srcs As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//ul[@class='products-list-page']//a") '< - - - select nodes with links
            For Each src As HtmlNode In srcs
                Console.WriteLine(src.Attributes("href").Value) '< - - - Print urls

            Next

                Console.Read()

        End Sub

    End Module

1 个答案:

答案 0 :(得分:1)

您必须先加载网页,然后选择所需的节点和要打印的属性。

这是一种方法:

    Dim mainUrl As String = "https://www.nordicwater.com/products/waste-water/"
    Dim htmlDoc As HtmlDocument = New HtmlWeb().Load(mainUrl) '< - - - Load the webage into htmldocument

    Dim srcs As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//ul[@class='products-list-page']//a") '< - - - select nodes with links
    For Each src As HtmlNode In srcs
        Console.WriteLine(src.Attributes("href").Value) '< - - - Print urls
    Next

您需要学习调试,如果您检查了代码,就会发现您是在将“ htmlDoc” html设置为url字符串,而不是加载实际的网页html。