我正在解析一个网页,以返回页面上的所有唯一句子,每个句子至少有两个单词。它几乎可以工作。以下内容在页面中显示为一个句子,但我的代码正在删除<b></b>
标记中的文本。如何删除内联样式/标签以返回显示在浏览器中的句子,其中包含粗体标签中的文本或任何其他内联样式(如强标签)?
目前它将NHL季后赛作为一行文字返回,然后是Takeaways:Sharks击败Penguins获得第一次Stanley Cup Final胜利,这是第二句,当时只有一句话。
<span class="titletext"><b>NHL Playoffs</b> Takeaways: Sharks beat Penguins for first Stanley Cup Final win</span>
这是我的asp.net vb.net代码(c#解决方案很好)。
Public Shared Function validateIsMoreThanOneWord(input As String, numberWords As Integer) As Boolean
If String.IsNullOrEmpty(input) Then
Return False
End If
Return (input.Split(New Char() {" "c}, StringSplitOptions.RemoveEmptyEntries).Length >= numberWords)
End Function
Private Sub form1_Load(sender As Object, e As EventArgs) Handles form1.Load
Try
Dim html = New HtmlDocument()
html.LoadHtml(New WebClient().DownloadString("http://news.google.ca/nwshp?hl=en&ei=4H1UV7-NNOfCjwTAl4bABw&ved=0EKkuCAkoBw"))
Dim root = html.DocumentNode
Dim myList As New List(Of String)()
For Each node As HtmlNode In root.Descendants().Where(Function(n) n.NodeType = HtmlNodeType.Text AndAlso n.ParentNode.Name <> "script" AndAlso n.ParentNode.Name <> "style" AndAlso n.ParentNode.Name <> "css")
If Not node.HasChildNodes Then
Dim text As String = HttpUtility.HtmlDecode(node.InnerText)
If Not String.IsNullOrEmpty(text) And Not String.IsNullOrWhiteSpace(text) Then
If validateIsMoreThanOneWord(text.Trim(), 2) Then
myList.Add(text.Trim())
End If
End If
End If
Next
'remove dups from array and other stuff
Dim q As String() = myList.Distinct().ToArray()
For i As Integer = 0 To UBound(q)
Response.Write(q(i).Trim() & "<br/>")
Next
Response.Write(q.Count)
Catch ex As Exception
Response.Write(ex.Message)
End Try
End Sub
希望你能对解决方案有所了解。谢谢!
答案 0 :(得分:0)
由于您循环遍历父级不是<script>
,<style>
或css
的所有根后代节点,因此您确实会将.titleText中的每个子节点视为另一个节点。文本。
您想要的是检索每个InnerText
条目的.titletext
。
以下是我在C#中所做的事情,你可以了解你需要做什么。
HtmlWeb w = new HtmlWeb();
var htmlDoc = w.Load("http://news.google.ca/nwshp?hl=en&ei=4H1UV7-NNOfCjwTAl4bABw&ved=0EKkuCAkoBw");
var textTitles = htmlDoc.DocumentNode.SelectNodes("//span[@class='titletext']");
//for testing purposes
foreach (var textTitle in textTitles)
Console.WriteLine(textTitle.InnerText);