Question

我最近开始尝试使用HtmlAgilityPack。我不熟悉它的所有选项，我认为因为我做错了。

我有一个包含以下内容的字符串：

string s = "<span style=\"color: #0000FF;\"><</span>";

你看，在我的范围内，我有一个'不到'的标志。我使用以下代码处理此字符串：

HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(s);

但是当我在这样的范围内做一个快速而肮脏的表情时：

htmlDocument.DocumentNode.ChildNodes[0].InnerHtml

我看到跨度是空的。

我需要设置什么选项才能保持“小于”符号。我已经尝试过了：

htmlDocument.OptionAutoCloseOnEnd = false;
htmlDocument.OptionCheckSyntax = false;
htmlDocument.OptionFixNestedTags = false;

但没有成功。

我知道它是无效的HTML。我使用它来修复无效的HTML并在“少于”标志上使用HTMLEncode

请指导我正确的方向。提前致谢

Answer 1

Html Agility Packs将此检测为错误并为其创建HtmlParseError实例。您可以使用HtmlDocument类的ParseErrors读取所有错误。所以，如果你运行这段代码：

    string s = "<span style=\"color: #0000FF;\"><</span>";
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(s);
    doc.Save(Console.Out);

    Console.WriteLine();
    Console.WriteLine();

    foreach (HtmlParseError err in doc.ParseErrors)
    {
        Console.WriteLine("Error");
        Console.WriteLine(" code=" + err.Code);
        Console.WriteLine(" reason=" + err.Reason);
        Console.WriteLine(" text=" + err.SourceText);
        Console.WriteLine(" line=" + err.Line);
        Console.WriteLine(" pos=" + err.StreamPosition);
        Console.WriteLine(" col=" + err.LinePosition);
    }

它将显示此信息（首先是更正的文本，然后是错误的详细信息）：

<span style="color: #0000FF;"></span>

Error
 code=EndTagNotRequired
 reason=End tag </> is not required
 text=<
 line=1
 pos=30
 col=31

因此，您可以尝试修复此错误，因为您拥有所有必需的信息（包括行，列和流位置），但HTML中修复（未检测）错误的一般过程非常复杂。

Answer 2

正如另一个答案所述，我发现的最佳解决方案是预先解析HTML，将孤立的<符号转换为HTML编码值<。

return Regex.Replace(html, "<(?![^<]+>)", "&lt;");

Answer 3

修复标记，因为您的HTML字符串无效：

string s = "<span style=\"color: #0000FF;\">&lt;</span>";

Answer 4

虽然给定的html确实无效，但HtmlAgilityPack仍然可以解析它。在网络上忘记编码“<”并不是一个不常见的错误，如果HtmlAgilityPack用作爬虫，那么它应该预测坏的HTML。我在IE，Chrome和Firefox中测试了这个示例，它们都显示了额外的<文本。

我编写了以下方法，您可以使用它来预处理html字符串，并用'<'替换所有“未关闭的”"<"字符：

static string PreProcess(string htmlInput)
{
    // Stores the index of the last unclosed '<' character, or -1 if the last '<' character is closed.
    int lastGt = -1; 

    // This list will be populated with all the unclosed '<' characters.
    List<int> gtPositions = new List<int>();

    // Collect the unclosed '<' characters.
    for (int i = 0; i < htmlInput.Length; i++)
    {
        if (htmlInput[i] == '<')
        {
            if (lastGt != -1)
                gtPositions.Add(lastGt);

            lastGt = i;
        }
        else if (htmlInput[i] == '>')
            lastGt = -1;
    }

    if (lastGt != -1)
        gtPositions.Add(lastGt);

    // If no unclosed '<' characters are found, then just return the input string.
    if (gtPositions.Count == 0)
        return htmlInput;

    // Build the output string, replace all unclosed '<' character by "&lt;".
    StringBuilder htmlOutput = new StringBuilder(htmlInput.Length + 3 * gtPositions.Count);
    int start = 0;

    foreach (int gtPosition in gtPositions)
    {
        htmlOutput.Append(htmlInput.Substring(start, gtPosition - start));
        htmlOutput.Append("&lt;");
        start = gtPosition + 1;
    }

    htmlOutput.Append(htmlInput.Substring(start));
    return htmlOutput.ToString();
}

Answer 5

字符串“s”是糟糕的HTML。

string s = "<span style=\"color: #0000FF;\">&lt;</span>";

这是真的。

丢失HtmlAgilityPack loadhtml中的“小于”符号

5 个答案: