忽略HtmlNode.InnerText中的空格

时间:2016-07-25 13:06:15

标签: c# html-agility-pack

我有HTML代码段:

<p>Rendered on a website, 
this will all be on one line.</p>
<p>This would be on another line.</p>

和C#代码:

HtmlDocument doc = new HtmlDocument();
doc.Load(path);

string text = HtmlEntity.DeEntitize(doc.DocumentNode.InnerText);

现在“text”将在3行上:

Rendered on a website, 
this will all be on one line.
This would be on another line.

但我想:

Rendered on a website, this will all be on one line.
This would be on another line.

这可以使用HtmlAgilityPack吗?

1 个答案:

答案 0 :(得分:0)

您可以执行类似

的操作
string html = @"<p>Rendered on a website,
                this will all be on one line.</p>
                <p>This would be on another line.</p>";

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);

string text = HtmlEntity.DeEntitize(doc.DocumentNode.InnerText);
Regex r = new Regex(@"\s+");
var sentences = text.Replace(",\r\n", ", ").Split(new string[] { "\r\n" }, StringSplitOptions.RemoveEmptyEntries);
var finalText = string.Join("\r\n", sentences.Select(s => r.Replace(s, " ").Trim()));

Console.WriteLine(text + "\n");
Console.WriteLine(finalText + "\n");

你真的不需要正则表达式,我只是用它来摆脱我在html变量中硬编码html所添加的表格/间距字符。

输出:

enter image description here