Question

我有一个字符串，我需要从中删除所有HTML和XML。我对正则表达式并不是很好。对于HTML，我发现了一些非常有用的代码：

snippet = Regex.Replace(snippet, "<.*?>", "");

目前我正在为XML做这个：

while (snippet.IndexOf("<xml>") != -1)
            {
                int startLoc = snippet.IndexOf("<xml>");
                int endLoc = snippet.IndexOf("</xml>");
                snippet = snippet.Remove(startLoc, (endLoc - startLoc) + 6);
            }
            while (snippet.IndexOf("<style>") != -1)
            {
                int startLoc = snippet.IndexOf("<style>");
                int endLoc = snippet.IndexOf("</style>");
                snippet = snippet.Remove(startLoc, (endLoc - startLoc) + 8);
            }
            // only required for chrome and IE
            // removes - <object  classid="clsid:38481807-CA0E-42D2-BF39-B33AF135CC4D" id="ieooui">
            while (snippet.IndexOf("<object") != -1)
            {
                int startLoc = snippet.IndexOf("<object");
                int endLoc = snippet.IndexOf("id=\"ieooui\">");
                snippet = snippet.Remove(startLoc, (endLoc - startLoc) + 12);
            }
            // removes - <object id="ieooui" classid="clsid:38481807-CA0E-42D2-BF39-B33AF135CC4D">
            while (snippet.IndexOf("<object") != -1)
            {
                int startLoc = snippet.IndexOf("<object");
                int endLoc = snippet.IndexOf("classid=\"clsid:38481807-CA0E-42D2-BF39-B33AF135CC4D\"");
                snippet = snippet.Remove(startLoc, (endLoc - startLoc) + 52);
            }

哪个非常不整洁。可以some1请给我一个xml的正则表达式，特别是：

<object id="ieooui" classid="clsid:38481807-CA0E-42D2-BF39-B33AF135CC4D">

和

<object  classid="clsid:38481807-CA0E-42D2-BF39-B33AF135CC4D" id="ieooui">

非常感谢。

Answer 1

通常，您无法通过regexp解析HTML。嗯，从技术上讲，你可以，但正如你所说，它将“不整洁”。该任务通常使用SAX解析器。甚至没有它使用HTML / XML标记器。像这个http://www.codeproject.com/KB/recipes/HTML_XML_Scanner.aspx

从字符串中删除xml和html

1 个答案: