使用Html Agility Pack获取文本内容

时间:2011-07-26 01:06:13

标签: html vb.net html-agility-pack

我会尽力去具体。基本上在vb.net中使用爬虫,我更感兴趣的是提取页面的文本内容。我当前的应用程序使用Web浏览器控件在文本框中下载html源代码的主体,如下所示:

Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs)   Handles Button1.Click
    Dim url As String = "<url>"
    WebBrowser1.Navigate(url)
End Sub

Private Sub WebBrowser1_DocumentCompleted(ByVal sender As System.Object, ByVal e As    System.Windows.Forms.WebBrowserDocumentCompletedEventArgs) Handles WebBrowser1.DocumentCompleted
    TextBox2.Text = WebBrowser1.Document.Body.OuterHtml
End Sub

现在从这里开始,textbox2由垃圾html组成,其中包含href,img,ads,script等,但我需要 获取所有这些元数据并获取纯文本。

我可以应用正则表达式属性来获取所有异常,但我认为HAP更适合html解析器。

在这里搜索带我到这个页面,讨论'Meltdown'提到的 Whitelist 技术的使用

HTML Agility Pack strip tags NOT IN whitelist

但我如何在vb.net中应用它,因为它似乎是一个好主意?

请讨好男人..........

编辑:我发现下面显示的代码的vb.net版本,但似乎有错误

If i IsNot DeletableNodesXpath.Count - 1 Then
  

错误:IsNot要求操作数具有引用类型,但此操作数的值类型为整数

以下是代码:

Public NotInheritable Class HtmlSanitizer     私人子新()     结束子     私有共享ReadOnly白名单作为IDictionary(Of String,String())     私有共享DeletableNodesXpath作为新列表(字符串)()

Shared Sub New()
    Whitelist = New Dictionary(Of String, String())() From { _
        {"a", New () {"href"}}, _
        {"strong", Nothing}, _
        {"em", Nothing}, _
        {"blockquote", Nothing}, _
        {"b", Nothing}, _
        {"p", Nothing}, _
        {"ul", Nothing}, _
        {"ol", Nothing}, _
        {"li", Nothing}, _
        {"div", New () {"align"}}, _
        {"strike", Nothing}, _
        {"u", Nothing}, _
        {"sub", Nothing}, _
        {"sup", Nothing}, _
        {"table", Nothing}, _
        {"tr", Nothing}, _
        {"td", Nothing}, _
        {"th", Nothing} _
    }
End Sub

Public Shared Function Sanitize(input As String) As String
    If input.Trim().Length < 1 Then
        Return String.Empty
    End If
    Dim htmlDocument = New HtmlDocument()

    htmldocument.LoadHtml(input)
    SanitizeNode(htmldocument.DocumentNode)
    Dim xPath As String = HtmlSanitizer.CreateXPath()

    Return StripHtml(htmldocument.DocumentNode.WriteTo().Trim(), xPath)
End Function

Private Shared Sub SanitizeChildren(parentNode As HtmlNode)
    For i As Integer = parentNode.ChildNodes.Count - 1 To 0 Step -1
        SanitizeNode(parentNode.ChildNodes(i))
    Next
End Sub

Private Shared Sub SanitizeNode(node As HtmlNode)
    If node.NodeType = HtmlNodeType.Element Then
        If Not Whitelist.ContainsKey(node.Name) Then
            If Not DeletableNodesXpath.Contains(node.Name) Then
                'DeletableNodesXpath.Add(node.Name.Replace("?",""));
                node.Name = "removeableNode"
                DeletableNodesXpath.Add(node.Name)
            End If
            If node.HasChildNodes Then
                SanitizeChildren(node)
            End If

            Return
        End If

        If node.HasAttributes Then
            For i As Integer = node.Attributes.Count - 1 To 0 Step -1
                Dim currentAttribute As HtmlAttribute = node.Attributes(i)
                Dim allowedAttributes As String() = Whitelist(node.Name)
                If allowedAttributes IsNot Nothing Then
                    If Not allowedAttributes.Contains(currentAttribute.Name) Then
                        node.Attributes.Remove(currentAttribute)
                    End If
                Else
                    node.Attributes.Remove(currentAttribute)
                End If
            Next
        End If
    End If

    If node.HasChildNodes Then
        SanitizeChildren(node)
    End If
End Sub

Private Shared Function StripHtml(html As String, xPath As String) As String
    Dim htmlDoc As New HtmlDocument()
    htmlDoc.LoadHtml(html)
    If xPath.Length > 0 Then
        Dim invalidNodes As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes(xPath)
        For Each node As HtmlNode In invalidNodes
            node.ParentNode.RemoveChild(node, True)
        Next
    End If
    Return htmlDoc.DocumentNode.WriteContentTo()


End Function

Private Shared Function CreateXPath() As String
    Dim _xPath As String = String.Empty
    For i As Integer = 0 To DeletableNodesXpath.Count - 1
        If i IsNot DeletableNodesXpath.Count - 1 Then
            _xPath += String.Format("//{0}|", DeletableNodesXpath(i).ToString())
        Else
            _xPath += String.Format("//{0}", DeletableNodesXpath(i).ToString())
        End If
    Next
    Return _xPath
End Function
End Class

请有人帮忙??????

1 个答案:

答案 0 :(得分:0)

不要使用IsNot,只需使用<>即可。正如您在基本上检查一个整数的值不等于另一个整数的值 - 1.

我认为IsNot不能用于整数。

修改 我刚刚注意到这是超级超级老。刚看到7月26日的日期!