Question

任何人都有正则表达式，可以从正文标记中删除属性

例如：

<body bgcolor="White" style="font-family:sans-serif;font-size:10pt;">

返回：

<body>

看到仅删除特定属性的示例也很有趣，例如：

<body bgcolor="White" style="font-family:sans-serif;font-size:10pt;">

返回：

<body bgcolor="White">

Answer 1

You can't parse XHTML with regex。请改为查看HTML Agility Pack。

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);

HtmlNode body = doc.DocumentNode.SelectSingleNode("//body");
if (body != null)
{
    body.Attributes.Remove("style");
}

Answer 2

如果你正在做一个快速而又脏的shell脚本，并且你不打算使用这么多......

s/<body [^>]*>/<body>/

但是我必须同意其他人的意见，解析器是一个更好的主意。我知道有时你必须用有限的资源来做，但如果你依赖于正则表达式......它很有可能在你最不期望的时候回来咬你。

并删除特定属性：

s/\(<body [^>]*\) style="[^>"]*"/\1/

这将获取“body”以及任何属于“style”的属性，删除“style”属性，然后吐出其余属性。

Answer 3

使用正则表达式实现这三种方法......

string html = "<body bgcolor=\"White\" style=\"font-family:sans-serif;font-size:10pt;\">";
string a1 = Regex.Replace(html, @"(?<=<body\b).*?(?=>)", "");
string a2 = Regex.Replace(html, @"<(body)\b.*?>", "<$1>");
string a3 = Regex.Replace(html, @"<(body)(\s[^>]*)?>", "<$1>");
Console.WriteLine(a1);
Console.WriteLine(a2);
Console.WriteLine(a3);

Answer 4

上面的LittleBobbyTables评论是正确的！

正则表达式不是正确的工具，如果你读它，它实际上是正确的，使用正则表达式会导致你过度紧张和压力，因为答案清楚地显示在LittleBobbyTables发布的链接上，回答者是什么由于错误的工作使用了错误的工具而经历过。

正则表达式不用于执行此类操作的胶带也不是所有内容的答案，包括42 ... 使用正确的工具来完成正确的工作

然而，你应该看看HtmlAgilityPack哪个会为你完成这项工作，并最终通过使用正则表达式解析html来解决死亡问题，从而避免压力，眼泪和血液... < / p>

Answer 5

以下是SharpQuery

中的操作方法

string html = "<body bgcolor=\"White\" style=\"font-family:sans-serif;font-size:10pt;\">";
var sq = SharpQuery.Load(html);
var body = sq.Find("body").Single();
foreach (var a in body.Attributes.ToArray())
    a.Remove();
StringWriter sw = new StringWriter();
body.OwnerDocument.Save(sw);
Console.WriteLine(sw.ToString());

这取决于HtmlAgilityPack并且是测试版产品......但我想证明你可以这样做。

Answer 6

string pattern = @"<body[^>]*>";
string test = @"<body bgcolor=""White"" style=""font-family:sans-serif;font-size:10pt;"">";
string result = Regex.Replace(test,pattern,"<body>",RegexOptions.IgnoreCase);
Console.WriteLine("{0}",result);
string pattern2 = @"(?<=<body[^>]*)\s*style=""[^""]*""(?=[^>]*>)";
result = Regex.Replace(test, pattern2, "", RegexOptions.IgnoreCase);
Console.WriteLine("{0}",result);

这是为了防止您的项目要求限制您的第三方选项（并且没有给您时间重新发明解析器）。

Answer 7

我刚刚开始工作的厚实代码，将会考虑减少这个：

private static string SimpleHtmlCleanup(string html)
        {
            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml(html);

            //foreach(HtmlNode nodebody in doc.DocumentNode.SelectNodes("//a[@href]"))

            var bodyNodes = doc.DocumentNode.SelectNodes("//body");
            if (bodyNodes != null)
            {
                foreach (HtmlNode nodeBody in bodyNodes)
                {
                    nodeBody.Attributes.Remove("style"); 
                }
            }

            var scriptNodes = doc.DocumentNode.SelectNodes("//script");
            if (scriptNodes != null)
            {
                foreach (HtmlNode nodeScript in scriptNodes)
                {
                    nodeScript.Remove();
                }
            }

            var linkNodes = doc.DocumentNode.SelectNodes("//link");
            if (linkNodes != null)
            {
                foreach (HtmlNode nodeLink in linkNodes)
                {
                    nodeLink.Remove();
                }
            }

            var xmlNodes = doc.DocumentNode.SelectNodes("//xml");
            if (xmlNodes != null)
            {
                foreach (HtmlNode nodeXml in xmlNodes)
                {
                    nodeXml.Remove();
                }
            }

            var styleNodes = doc.DocumentNode.SelectNodes("//style");
            if (styleNodes != null)
            {
                foreach (HtmlNode nodeStyle in styleNodes)
                {
                    nodeStyle.Remove();
                }
            }

            var metaNodes = doc.DocumentNode.SelectNodes("//meta");
            if (metaNodes != null)
            {
                foreach (HtmlNode nodeMeta in metaNodes)
                {
                    nodeMeta.Remove();
                }
            }

            var result = doc.DocumentNode.OuterHtml;

            return result;
        }

正则表达式删除正文标记属性（C＃）

7 个答案: