HtmlAgilityPack在<之后处理所有内容(小于号)作为属性

时间:2016-06-15 11:19:15

标签: c# html-agility-pack

我通过textarea获得了一些输入,并将该输入转换为html文档,稍后将其解析为PDF文档。

当我的用户输入小于号(<)时,我的HtmlDocument中的所有内容都会刹车。 HtmlAgilityPack突然处理少于符号作为属性后的所有内容。见输出:

  

                 在这个字符数据块中,我可以根据需要使用双短划线(连同<,&,=""',=""和=& #34;"')="" *和="" * ="" (%)="" myparamentity; =""将=""是=""扩展=""到=""的=""文本="" '具有=""一直=""扩大' ...然而,="" I =""可' T =""使用=""的="" CEND =""序列(if ="" i ="" need ="" to ="" use ="&# 34;它="" i =""必须=""逃避=""一个="&# 34; of ="" =""括号=""或="" ="&# 34;大于=""签署)。="">

如果我只是添加

,它会好一点
htmlDocument.OptionOutputOptimizeAttributeValues = true;

给了我:

  

                 在这个字符数据块中,我可以根据需要使用双破折号(以及<,&,=',=和=')= *和= * =%= myparamentity; =将= be = expanded = to = the = text =' has = been = expanded' ...但是,= i = can' t = use = the = cend = sequence(if = i = need = to = use = it = i = must = escape = one = of = the = bracket = or = the = greater-than = sign)。=>

我已经尝试了htmldocument上的所有选项,但没有一个让我指定解析器不应该是严格的。另一方面,我可以忍受它剥离<,但添加所有等号并不适合我。

void Main()
{
    var input = @"Within this Character Data block I can use double dashes as much as I want (along with <, &, ', and ') *and * % MyParamEntity; will be expanded to the text 'Has been expanded'...however, I can't use the CEND sequence(if I need to use it I must escape one of the brackets or the greater-than sign).";

    var htmlDoc = WrapContentInHtml(input);

    htmlDoc.DocumentNode.OuterHtml.ToString().Dump();
}

private HtmlDocument WrapContentInHtml(string content)
{
    var htmlBuilder = new StringBuilder();
    htmlBuilder.AppendLine("<!DOCTYPE html>");
    htmlBuilder.AppendLine("<html>");
    htmlBuilder.AppendLine("<head>");
    htmlBuilder.AppendLine("<title></title>");
    htmlBuilder.AppendLine("</head>");
    htmlBuilder.AppendLine("<body><div id='sagsfremstillingContainer'>");
    htmlBuilder.AppendLine(content); 
    htmlBuilder.AppendLine("</div></body></html>");

    var htmlDocument = new HtmlDocument();
    htmlDocument.OptionOutputOptimizeAttributeValues = true;
    var htmlDoc = htmlBuilder.ToString();

    htmlDocument.LoadHtml(htmlDoc);

    return htmlDocument;
}

有没有人知道如何解决这个问题。

我能找到的最接近的问题是: Losing the 'less than' sign in HtmlAgilityPack loadhtml

他实际上抱怨&lt;消失对我来说没问题。当然,修复解析错误是最佳解决方案。

编辑: 我正在使用HtmlAgilityPack 1.4.9

2 个答案:

答案 0 :(得分:3)

您的内容明显错误。这不是关于“严格”,它实际上是关于你假装一段文本是有效的HTML的事实。事实上,您获得的结果正是因为解析器严格。

当您需要将纯文本插入HTML时,需要先对其进行编码,以便将所有各种HTML控制字符正确转换为HTML - 例如,<必须更改为&lt; }和&&amp;

处理此问题的一种方法是使用DOM - 在目标InnerText上使用div,而不是将字符串拼接在一起并假装它们是HTML。另一种方法是使用一些显式编码方法 - 例如HttpUtility.HtmlEncode

答案 1 :(得分:1)

您可以使用System.Net.WebUtility.HtmlEncode,即使没有System.Web.dll也有HttpServerUtility.HtmlEncode

的引用
var input = @"Within this Character Data block I can use double dashes as much as I want (along with <, &, ', and ') *and * % MyParamEntity; will be expanded to the text 'Has been expanded'...however, I can't use the CEND sequence(if I need to use it I must escape one of the brackets or the greater-than sign).";
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(System.Net.WebUtility.HtmlEncode(input));
Debug.Assert(!htmlDocument.ParseErrors.Any());

结果:

Within this Character Data block I can use double dashes as much as I want (along with &lt;, &amp;, &#39;, and &#39;) *and * % MyParamEntity; will be expanded to the text &#39;Has been expanded&#39;...however, I can&#39;t use the CEND sequence(if I need to use it I must escape one of the brackets or the greater-than sign).