我会尽力去具体。基本上在vb.net中使用爬虫,我更感兴趣的是提取页面的文本内容。我当前的应用程序使用Web浏览器控件在文本框中下载html源代码的主体,如下所示:
Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
Dim url As String = "<url>"
WebBrowser1.Navigate(url)
End Sub
Private Sub WebBrowser1_DocumentCompleted(ByVal sender As System.Object, ByVal e As System.Windows.Forms.WebBrowserDocumentCompletedEventArgs) Handles WebBrowser1.DocumentCompleted
TextBox2.Text = WebBrowser1.Document.Body.OuterHtml
End Sub
现在从这里开始,textbox2由垃圾html组成,其中包含href,img,ads,script等,但我需要 获取所有这些元数据并获取纯文本。
我可以应用正则表达式属性来获取所有异常,但我认为HAP更适合html解析器。
在这里搜索带我到这个页面,讨论'Meltdown'提到的 Whitelist 技术的使用
HTML Agility Pack strip tags NOT IN whitelist
但我如何在vb.net中应用它,因为它似乎是一个好主意?
请讨好男人..........
编辑:我发现下面显示的代码的vb.net版本,但似乎有错误
If i IsNot DeletableNodesXpath.Count - 1 Then
错误:IsNot要求操作数具有引用类型,但此操作数的值类型为整数
以下是代码:
Public NotInheritable Class HtmlSanitizer 私人子新() 结束子 私有共享ReadOnly白名单作为IDictionary(Of String,String()) 私有共享DeletableNodesXpath作为新列表(字符串)()
Shared Sub New()
Whitelist = New Dictionary(Of String, String())() From { _
{"a", New () {"href"}}, _
{"strong", Nothing}, _
{"em", Nothing}, _
{"blockquote", Nothing}, _
{"b", Nothing}, _
{"p", Nothing}, _
{"ul", Nothing}, _
{"ol", Nothing}, _
{"li", Nothing}, _
{"div", New () {"align"}}, _
{"strike", Nothing}, _
{"u", Nothing}, _
{"sub", Nothing}, _
{"sup", Nothing}, _
{"table", Nothing}, _
{"tr", Nothing}, _
{"td", Nothing}, _
{"th", Nothing} _
}
End Sub
Public Shared Function Sanitize(input As String) As String
If input.Trim().Length < 1 Then
Return String.Empty
End If
Dim htmlDocument = New HtmlDocument()
htmldocument.LoadHtml(input)
SanitizeNode(htmldocument.DocumentNode)
Dim xPath As String = HtmlSanitizer.CreateXPath()
Return StripHtml(htmldocument.DocumentNode.WriteTo().Trim(), xPath)
End Function
Private Shared Sub SanitizeChildren(parentNode As HtmlNode)
For i As Integer = parentNode.ChildNodes.Count - 1 To 0 Step -1
SanitizeNode(parentNode.ChildNodes(i))
Next
End Sub
Private Shared Sub SanitizeNode(node As HtmlNode)
If node.NodeType = HtmlNodeType.Element Then
If Not Whitelist.ContainsKey(node.Name) Then
If Not DeletableNodesXpath.Contains(node.Name) Then
'DeletableNodesXpath.Add(node.Name.Replace("?",""));
node.Name = "removeableNode"
DeletableNodesXpath.Add(node.Name)
End If
If node.HasChildNodes Then
SanitizeChildren(node)
End If
Return
End If
If node.HasAttributes Then
For i As Integer = node.Attributes.Count - 1 To 0 Step -1
Dim currentAttribute As HtmlAttribute = node.Attributes(i)
Dim allowedAttributes As String() = Whitelist(node.Name)
If allowedAttributes IsNot Nothing Then
If Not allowedAttributes.Contains(currentAttribute.Name) Then
node.Attributes.Remove(currentAttribute)
End If
Else
node.Attributes.Remove(currentAttribute)
End If
Next
End If
End If
If node.HasChildNodes Then
SanitizeChildren(node)
End If
End Sub
Private Shared Function StripHtml(html As String, xPath As String) As String
Dim htmlDoc As New HtmlDocument()
htmlDoc.LoadHtml(html)
If xPath.Length > 0 Then
Dim invalidNodes As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes(xPath)
For Each node As HtmlNode In invalidNodes
node.ParentNode.RemoveChild(node, True)
Next
End If
Return htmlDoc.DocumentNode.WriteContentTo()
End Function
Private Shared Function CreateXPath() As String
Dim _xPath As String = String.Empty
For i As Integer = 0 To DeletableNodesXpath.Count - 1
If i IsNot DeletableNodesXpath.Count - 1 Then
_xPath += String.Format("//{0}|", DeletableNodesXpath(i).ToString())
Else
_xPath += String.Format("//{0}", DeletableNodesXpath(i).ToString())
End If
Next
Return _xPath
End Function
End Class
请有人帮忙??????
答案 0 :(得分:0)
不要使用IsNot
,只需使用<>
即可。正如您在基本上检查一个整数的值不等于另一个整数的值 - 1.
我认为IsNot
不能用于整数。
修改强> 我刚刚注意到这是超级超级老。刚看到7月26日的日期!