如何从其html源提取页面上可见的文本?

时间:2012-02-05 22:58:06

标签: c# html

我尝试了HtmlAgilityPack和以下代码,但它没有从html列表中捕获文本:

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlStr);
HtmlNode node = doc.DocumentNode;
return node.InnerText;

以下是失败的代码:

<as html>
<p>This line is picked up <b>correctly</b>.  List items hasn't...</p>
<p><ul>
<li>List Item 1</li>
<li>List Item 2</li>
<li>List Item 3</li> 
<li>List Item 4</li>
</ul></p>
</as html>

2 个答案:

答案 0 :(得分:3)

因为你需要以某种方式遍历树和连接所有节点的InnerText

答案 1 :(得分:3)

以下代码对我有用:

string StripHTML(string htmlStr)
{
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(htmlStr);
    var root = doc.DocumentNode;
    string s = "";
    foreach (var node in root.DescendantNodesAndSelf())
    {
        if (!node.HasChildNodes)
        {
            string text = node.InnerText;
            if (!string.IsNullOrEmpty(text))
            s += text.Trim() + " ";                     
        }
    }
    return s.Trim();
}