Question

我需要从html获取所有节点，然后从那些节点获取文本和子节点，以及从子子节点获得相同的东西。例如，我有这个HTML：

<p>This <b>is a <a href="">Link</a></b> with <b>bold</b></p>

所以我需要一种方法来获取 p 节点，然后是非格式化文本（this），唯一粗体文本（是），粗体链接（ Link ）和其他格式化的文字。

我知道使用htmldocument我可以选择所有节点和子节点，但是，如何在子节点，子节点及其文本/子节点之前获取文本以便我可以html的渲染版本（＆＃34;此为Link ，粗体＆＃34;）？

请注意，上面的示例很简单。 HTML会有更复杂的东西，如列表，框架，编号列表，三重格式的文本等。另请注意，渲染的东西不是问题。我已经这样做了，但另一方面。我需要的是仅获取节点及其内容的部分。此外，我无法忽略任何节点，因此我无法进行任何过滤。主节点可以从p，div，frame，ul等开始。

Answer 1

在查看htmldoc及其属性后，感谢@HungCao的观察，我得到了一种解释HTML代码的简单方法。

我的代码稍微复杂一点，所以我会发布它的精简版。

首先，必须加载htmlDoc。它可以是任何功能：

HtmlDocument htmlDoc = new HtmlDocument();
string html = @"<p>This <b>is a <a href="""">Link</a></b> with <b>bold</b></p>";
htmlDoc.LoadHtml(html);

然后我们需要解释每个＆＃34; main＆＃34;节点（在本例中为p），并且根据其类型，我们需要加载一个LoopFunction（InterNode）

HtmlNodeCollection nodes = htmlDoc.DocumentNode.ChildNodes;

foreach (HtmlNode node in nodes)
{
    if(node.Name.ToLower() == "p") //Low the typeName just in case
    {
        Paragraph newPPara = new Paragraph();
        foreach(HtmlNode childNode in node.ChildNodes)
        {
            InterNode(childNode, ref newPPara);
        }
        richTextBlock.Blocks.Add(newPPara);
    }
}

请注意，有一个名为＆＃34; NodeType＆＃34;的属性，但它不会返回正确的类型。所以，改为使用＆＃34; Name＆＃34; property（另请注意，htmlNode中的Name属性与HTML中的Name属性不同。）

最后，我们有InterNode函数，它将为引用的（ref）段落

添加内联

public bool InterNode(HtmlNode htmlNode, ref Paragraph originalPar)
{
    string htmlNodeName = htmlNode.Name.ToLower();

    List<string> nodeAttList = new List<string>();
    HtmlNode parentNode = htmlNode.ParentNode;
    while (parentNode != null) {
        nodeAttList.Add(parentNode.Name);
        parentNode = parentNode.ParentNode;
    } //we need to get it multiple types, because it could be b(old) and i(talic) at the same time.

    Inline newRun = new Run();
    foreach (string noteAttStr in nodeAttList) //with this we can set all the attributes to the inline
    {
        switch (noteAttStr)
        {
            case ("b"):
            case ("strong"):
                {
                    newRun.FontWeight = FontWeights.Bold;
                    break;
                }
            case ("i"):
            case ("em"):
                {
                    newRun.FontStyle = FontStyle.Italic;
                    break;
                }
        }
    }

    if(htmlNodeName == "#text") //the #text means that its a text node. Like <i><#text/></i>. Thanks @HungCao
    {
        ((Run)newRun).Text = htmlNode.InnerText;
    } else //if it is not a #text, don't load its innertext, as it's another node and it will always have a #text node as a child (if it has any text)
    {
        foreach (HtmlNode childNode in htmlNode.ChildNodes)
        {
            InterNode(childNode, ref originalPar);
        }
    }

    return true;
}

注意：我知道我说我的应用需要以webview所做的另一种方式呈现HTML，而且我知道这个示例代码生成的内容与Webview相同，但正如我之前所说，这是只是我最终代码的精简版。事实上，我的原始/完整代码正在我需要的工作，这只是基础。

使用htmldocument / HtmlAgilityPack

1 个答案: