Question

我需要使用C＃从HTML文件中提取文本。我正在尝试使用HTMLAgilityPack，但我看到一些解析错误（标签未关闭）。我正在使用这两个选项：

        htmlDoc.OptionFixNestedTags = true;
        htmlDoc.OptionAutoCloseOnEnd = true;

是否有“全部修复”类型选项。我不关心错误，我只想要内容或关闭。

Answer 1

也许这是解决方法但是一旦我不得不从HTML中提取文本我使用了正则表达式：

result = Regex.Replace(result, @"<(.|\n)*?>", String.Empty);
result = Regex.Replace(result, @"^\n*", String.Empty, RegexOptions.Singleline | RegexOptions.IgnoreCase);
result = Regex.Replace(result, @"\n*$", String.Empty, RegexOptions.Singleline | RegexOptions.IgnoreCase);
result = result.Replace("\n", " ");

C＃HTMLAgilityPack HTML to Text - Parse Errors

1 个答案: