Question

我使用htmlagilitypack＆amp; xpath的。

如何识别html中的insconsistancy。例如：

<table><tr><td>
<b>Car1</b><span>Color123</span>
<bCar2</b><span>Color333</span>
<b>Car3</b><span>Color221</span>
<b>Car4 <span>Color224</span>
<b>Car5</b><span>Color621</span>
</table></tr></td>

Car2＆amp; Car4大胆破了。

问题是我使用root.SelectNodes（“// b [1]”）[索引]并且它错过索引位置2（Car2）并将其置于其位置Car3而我甚至不知道发生了这样的事情如果我不手动检查结果。至少，我需要“空”位置2（Car2）和正确位置3（Car3）。

HtmlAgility包无法自动识别并修复它。 doc.ParseErrors无法识别它。

你能提供一些XPath函数的组合，比如Substring，Boolean，Concat，Number等吗？我在XPath中不够好，但我觉得这些功能可以帮助识别不一致。

P.S。 Html Tidy库无法修复它。它有时会决定：

<b>Car4 <span>Color224</span></b>

哪个不正确。

Answer 1

HtmlDocumemt.ParseErrors确实包含3个错误示例：

 - Start tag <b> was not found (because there is a closing b without an opening one)
 - Start tag <tr> was not found (because the tr is inside an opening b without a closing one)
 - Start tag <td> was not found (same as tr)

在一般情况下，不可能1）以您想要的方式识别错误，2）更难以修复它们。您必须准确定义预期的格式。

您可以使用Html Agility Pack识别具有特定要求的错误。例如，这里有一段代码，根据功能要求验证您的文档，“TD的每个子元素必须是B或SPAN且不得包含多个大子元素”：

    HtmlDocument doc = new HtmlDocument();
    doc.Load("MyFile.htm");

    foreach (HtmlNode childOfTd in doc.DocumentNode.SelectNodes("//td/*"))
    {
        if ((childOfTd.Name != "b") && (childOfTd.Name != "span") || (childOfTd.ChildNodes.Count > 1))
        {
            Console.WriteLine("child error, outerHtml=" + childOfTd.OuterHtml);
        }
    }

要解决此问题，需要原始文本访问（可能是正则表达式，BTW，正则表达式也可以识别简单错误），因为Html Agility Pack构建的DOM不允许您按设计访问不正确的语法节点。

htmlagilitypack识别不一致

1 个答案: