C#替换HTML标记

时间:2018-01-25 22:07:12

标签: c# regex

好的,基本上我正在创建一个控制台应用程序。 它从互联网上掠夺了价值观。但我需要更换标签, 现在它显示匹配但显示标签。我试图替换它,但我一直在失败。

foreach (Match match in Regex.Matches(data, pattern))
        {
            Console.WriteLine(match.Value);
        }

这显示了匹配项,但我无法弄清楚如何从那里删除标记。我无法使用任何匹配的代码。它只是不起作用。

有什么想法吗?

1 个答案:

答案 0 :(得分:0)

您可以使用this代码段将HTML转换为纯文本:

// This function converts HTML code to plain text
// Any step is commented to explain it better
// You can change or remove unnecessary parts to suite your needs
public string HTMLToText(string HTMLCode)
{
 // Remove new lines since they are not visible in HTML
 HTMLCode = HTMLCode.Replace("\n", " ");

 // Remove tab spaces
 HTMLCode = HTMLCode.Replace("\t", " ");

 // Remove multiple white spaces from HTML
 HTMLCode = Regex.Replace(HTMLCode, "\\s+", " ");

 // Remove HEAD tag
 HTMLCode = Regex.Replace(HTMLCode, "<head.*?</head>", ""
                     , RegexOptions.IgnoreCase | RegexOptions.Singleline);

 // Remove any JavaScript
 HTMLCode = Regex.Replace(HTMLCode, "<script.*?</script>", ""
   , RegexOptions.IgnoreCase | RegexOptions.Singleline);

 // Replace special characters like &, <, >, " etc.
 StringBuilder sbHTML = new StringBuilder(HTMLCode);
 // Note: There are many more special characters, these are just
 // most common. You can add new characters in this arrays if needed
 string[] OldWords = {"&nbsp;", "&amp;", "&quot;", "&lt;", 
   "&gt;", "&reg;", "&copy;", "&bull;", "&trade;"};
 string[] NewWords = {" ", "&", "\"", "<", ">", "®", "©", "•", "™"};
 for(int i = 0; i < OldWords.Length; i++)
 {
   sbHTML.Replace(OldWords[i], NewWords[i]);
 }

 // Check if there are line breaks (<br>) or paragraph (<p>)
 sbHTML.Replace("<br>", "\n<br>");
 sbHTML.Replace("<br ", "\n<br ");
 sbHTML.Replace("<p ", "\n<p ");

 // Finally, remove all HTML tags and return plain text
 return System.Text.RegularExpressions.Regex.Replace(
   sbHTML.ToString(), "<[^>]*>", "");
}

或者使用HtmlAgilityPack - 无论如何你应该使用.NET: