使用HtmlAgilityPack从innerHtml中删除内联样式

时间:2016-06-05 20:34:10

标签: c# asp.net vb.net html-agility-pack

我正在解析一个网页,以返回页面上的所有唯一句子,每个句子至少有两个单词。它几乎可以工作。以下内容在页面中显示为一个句子,但我的代码正在删除<b></b>标记中的文本。如何删除内联样式/标签以返回显示在浏览器中的句子,其中包含粗体标签中的文本或任何其他内联样式(如强标签)?

目前它将NHL季后赛作为一行文字返回,然后是Takeaways:Sharks击败Penguins获得第一次Stanley Cup Final胜利,这是第二句,当时只有一句话。

<span class="titletext"><b>NHL Playoffs</b> Takeaways: Sharks beat Penguins for first Stanley Cup Final win</span>

这是我的asp.net vb.net代码(c#解决方案很好)。

Public Shared Function validateIsMoreThanOneWord(input As String, numberWords As Integer) As Boolean
        If String.IsNullOrEmpty(input) Then
            Return False
        End If
        Return (input.Split(New Char() {" "c}, StringSplitOptions.RemoveEmptyEntries).Length >= numberWords)
    End Function

    Private Sub form1_Load(sender As Object, e As EventArgs) Handles form1.Load

        Try

            Dim html = New HtmlDocument()
            html.LoadHtml(New WebClient().DownloadString("http://news.google.ca/nwshp?hl=en&ei=4H1UV7-NNOfCjwTAl4bABw&ved=0EKkuCAkoBw"))

            Dim root = html.DocumentNode

            Dim myList As New List(Of String)()

            For Each node As HtmlNode In root.Descendants().Where(Function(n) n.NodeType = HtmlNodeType.Text AndAlso n.ParentNode.Name <> "script" AndAlso n.ParentNode.Name <> "style" AndAlso n.ParentNode.Name <> "css")

                If Not node.HasChildNodes Then
                    Dim text As String = HttpUtility.HtmlDecode(node.InnerText)

                    If Not String.IsNullOrEmpty(text) And Not String.IsNullOrWhiteSpace(text) Then
                        If validateIsMoreThanOneWord(text.Trim(), 2) Then
                            myList.Add(text.Trim())
                        End If
                    End If
                End If
            Next

            'remove dups from array and other stuff
            Dim q As String() = myList.Distinct().ToArray()

            For i As Integer = 0 To UBound(q)
                Response.Write(q(i).Trim() & "<br/>")
            Next

            Response.Write(q.Count)


        Catch ex As Exception
            Response.Write(ex.Message)
        End Try
    End Sub

希望你能对解决方案有所了解。谢谢!

1 个答案:

答案 0 :(得分:0)

由于您循环遍历父级不是<script><style>css的所有根后代节点,因此您确实会将.titleText中的每个子节点视为另一个节点。文本。

您想要的是检索每个InnerText条目的.titletext

以下是我在C#中所做的事情,你可以了解你需要做什么。

    HtmlWeb w = new HtmlWeb();
    var htmlDoc = w.Load("http://news.google.ca/nwshp?hl=en&ei=4H1UV7-NNOfCjwTAl4bABw&ved=0EKkuCAkoBw");
    var textTitles = htmlDoc.DocumentNode.SelectNodes("//span[@class='titletext']");

//for testing purposes
        foreach (var textTitle in textTitles)
            Console.WriteLine(textTitle.InnerText);