Question

关于这样的例子：

<p>there is something here <span>we can't have this</span> again here <em>but we keep this one</em> we are good to go now </p>

我有办法删除span节点，所以我只能得到所有其他标签的内部文本。但我需要保留span标签，但是当我得到它时跳过他的innerText。现在我有这个：

var paragraphe = html.DocumentNode.SelectNodes("p");
for (int i = 0; i < paragraphe.Count; i++)
{
    string innerTextOfP = paragraphe[i].InnerText;
    if (string.IsNullOrEmpty(innerTextOfP))
    {
        //Do something later.
    }
    else
    {
        //something is done here with the text I get.
    }
}

我能想到的最好方法是做另一件事：

var nodeSpan = html.DocumentNode.SelectNodes("span");

比较它，当我用字符串缓冲区迭代P部分的子句来获取文本并跳过内容paragraphe.childNode = nodeSpan但我认为Agility Pack有另一种方法来做这种事情但我不知道是什么。

在我的情况下，我还需要跳过DIV（和他的孩子）的内容，如果classe是其他"contenu"

所以我打算为Span做这件事我对DIV部分不利。

我应该怎么做agilityPack？

编辑：此案例的预期结果为：

string innerTextOfP = "there is something here again here but we keep this one we are good to go now"

Answer 1

您可以从段落中删除span个孩子：

var paragraphes = html.DocumentNode.SelectNodes("//p");

foreach (var p in paragraphes)
{
    var clone = p.Clone(); // to avoid modification of original html
    foreach (var span in clone.SelectNodes("span"))
        clone.RemoveChild(span);

    foreach (var div in clone.SelectNodes("div[not(@class='contenu')]"))
        clone.RemoveChild(div);

    // remove other nodes which you want to skip here

    string innerTextOfP = Regex.Replace(clone.InnerText, @"\s+", " ");
}

请注意，我使用正则表达式用一个空格替换几个连续的空格。输出是：

这里有一些东西在这里，但我们保留这个我们很好的现在去

AgilityPack选择innerText但跳过特定标记

1 个答案: