Question

我需要在HTMLDocument的所有文本节点上执行一些逻辑。这就是我目前的做法：

HTMLDocument pageContent = (HTMLDocument)_webBrowser2.Document;
IHTMLElementCollection myCol = pageContent.all;
foreach (IHTMLDOMNode myElement in myCol)
{
    foreach (IHTMLDOMNode child in (IHTMLDOMChildrenCollection)myElement.childNodes)
    {
        if (child.nodeType == 3)
        {
           //Do something with textnode!
        }
     }
 }

由于myCol中的某些元素也有子节点，它们本身位于myCol中，因此我不止一次访问某些节点！必须有更好的方法来做到这一点吗？

Answer 1

最好在递归函数中迭代childNodes（直接后代），从顶层开始，如下所示：

HtmlElementCollection collection = pageContent.GetElementsByTagName("HTML");
IHTMLDOMNode htmlNode = (IHTMLDOMNode)collection[0];
ProcessChildNodes(htmlNode);

private void ProcessChildNodes(IHTMLDOMNode node)
{
    foreach (IHTMLDOMNode childNode in node.childNodes)
    {
        if (childNode.nodeType == 3)
        {
            // ...
        }
        ProcessChildNodes(childNode);
    }
}

Answer 2

您可以使用HTML Agility Pack中的XPath一次访问所有文本节点。

我认为这会如图所示，但没有尝试过。

using HtmlAgilityPack;
HtmlDocument htmlDoc = new HtmlDocument();

// filePath is a path to a file containing the html
htmlDoc.Load(filePath);
HtmlNodeCollection coll = htmlDoc.DocumentNode.SelectNodes("//text()");

foreach (HTMLNode node in coll)
{
  // do the work for a text node here
}

如何在C＃中以最快的方式检索HTMLDocument的所有文本节点？

2 个答案: