Question

假设我有以下正文：

Call me Ishmael. Some years ago- never mind how long precisely- having little 
or no money in my purse, and nothing particular to interest me on shore, I 
thought I would sail about a little and see the watery part of the world. It is  
<?xml version="1.0" encoding="utf-8"?>
<RootElement xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
     xmlns:xsd="http://www.w3.org/2001/XMLSchema">
   <ChildElement />
   <ChildElement />
</RootElement>
a way I have of driving off the spleen and regulating the circulation. Whenever  
I find myself growing grim about the mouth; whenever it is a damp, drizzly 
November in my soul;

我可以使用哪种正则表达式将字母串中的XML嵌入返回给我？

注意：我可以假设<RootElement>和</RootElement>始终具有相同的名称。

Answer 1

如果您知道根元素始终为<RootElement ...>并且永远不会有嵌套的<RootElement>标记，则可以这样做：

\<\?xml .+?\</RootElement\>

此正则表达式将懒惰地匹配<?xml和</RootElement>之间的所有文字。

Answer 2

我知道根元素并不总是被称为RootElement，所以你可以使用

<\?xml[^>]+>\s*<\s*(\w+).*?<\s*/\s*\1>

使用RegexOptions.SingleLine。这将在开始'。标记之后获取第一个标记名称并捕获所有内容，直到匹配标记。

在C＃中：

resultString = Regex.Match(subjectString, @"<\?xml[^>]+>\s*<\s*(\w+).*?<\s*/\s*\1>", RegexOptions.Singleline).Value;

我可以使用什么正则表达式从未格式化的文本体中提取XML文本的主体？

2 个答案: