Question

我目前正在研究用C＃4.0编写的刮刀。我使用各种工具，包括.NET的内置WebClient和RegEx功能。对于我的刮刀的一部分，我正在使用HtmlAgilityPack解析HTML文档。我按照自己的意愿完成了所有工作，并完成了一些代码清理工作。

我正在使用HtmlEntity.DeEntitize()方法来清理HTML。我做了一些测试，这个方法看起来效果很好。但是当我在我的代码中实现该方法时，我不断获得KeyNotFoundException。没有进一步的细节，所以我很丢失。我的代码如下所示：

WebClient client = new WebClient();
string html = HtmlEntity.DeEntitize(client.DownloadString(path));
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);

下载的HTML是UTF-8编码的。如何绕过KeyNotFound例外？

Answer 1

据我所知，问题是由于非标准字符的出现造成的。比方说，例如，中文，日文等。

在您发现哪些字符导致问题后，也许您可以搜索合适的补丁来htmlagilitypack here

如果您想自己修改htmlagilitypack source，这可能对您有所帮助。

Answer 2

四年后，我对一些编码字符（版本1.4.9.5）也有同样的问题。在我的例子中，有一组有限的字符可能会产生问题，所以我刚刚创建了一个函数来执行替换：

// to be called before HtmlEntity.DeEntitize
public static string ReplaceProblematicHtmlEntities(string str)
{
    var sb = new StringBuilder(str);
    //TODO: add other replacements, as needed
    return sb.Replace("&period;", ".")
        .Replace("&abreve;", "ă")
        .Replace("&acirc;", "â")
        .ToString();
}

在我的例子中，字符串包含html编码字符和UTF-8字符，但问题仅与某些编码字符有关。

这不是一个优雅的解决方案，而是快速修复所有那些有问题（且已知）有问题的编码字符的文本。

Answer 3

我的HTML有一个像这样的文本块：

... found in sections: 233.9 & 517.3; ...

尽管有间距和小数点，但它将& 517.3;解释为unicode字符。

简单的HTML编码原始文本为我解决了问题。

string raw = "sections: 233.9 & 517.3;";
// turn '&' into '&amp;', etc, before DeEntitizing
string encoded = System.Web.HttpUtility.HtmlEncode(raw);
string deEntitized = HtmlEntity.DeEntitize(encoded);

Answer 4

就我而言，我已通过将HtmlAgilityPack更新为1.5.0版来解决此问题

使用HtmlEntity.DeEntitize（）方法的KeyNotFoundException

4 个答案: