正则表达式捕获标记和未标记的内容

时间:2016-12-19 11:42:56

标签: c# regex

我想要做的是从字符串中解析一些自定义标记,同时也获取未标记的内容。例如,我有以下字符串

Hello World <Red>This is some red text </Red> This is normal <Blue>This is blue text </Blue>

我使用

获取标记内容的工作正则表达式
<(?<tag>\w*)>(?<text>.*)</\k<tag>>

然而,这会返回

 tag: Red
 text: This is some red text
 tag: Blue
 text this is blue text

我需要的是获得未标记内容的匹配,所以我会得到4个匹配,两个像上面一样,还有“Hello World”和“This is normal”。

这是正则表达式可以实现的吗?

例如,这是我当前的功能:

 public static List<FormattedConsole> FormatColour(string input)
    {
        List<FormattedConsole> formatted = new List<FormattedConsole>();
        Regex regex = new Regex("<(?<Tag>\\w+)>(?<Text>.*?)</\\1>", RegexOptions.IgnoreCase
                | RegexOptions.CultureInvariant
                | RegexOptions.IgnorePatternWhitespace
                | RegexOptions.Compiled
        );

        MatchCollection ms = regex.Matches(input);

        foreach (Match match in ms)
        {
            GroupCollection groups = match.Groups;
            FormattedConsole format = new FormattedConsole(groups["Text"].Value, groups["Tag"].Value);
            formatted.Add(format);
        }

        return formatted;
    }

如上所述,这只返回标签之间的匹配。我还需要没有标签的文本。

(顺便说一下,FormattedConsole只是一个包含文字和颜色的容器)

2 个答案:

答案 0 :(得分:2)

如果您想尝试修改xml,可以尝试像这样的解决方案。我们将使用Linq。在线试用:https://dotnetfiddle.net/J4zVMY

using System;
using System.Collections.Generic;
using System.Linq;
using System.Xml.Linq;

public class Program
{   
    public static void Main()
    {
        string response = @"Hello World <Red>This is some red text </Red> This is normal <Blue>This is blue text </Blue>";
        response = @"<?xml version='1.0' encoding='utf-8'?><root>"+response+"</root>";
        var doc = XDocument.Parse(response);

        // fill all node in a list of Text
        var colors = new List<Text>();
        foreach (var hashElement in doc.Descendants().Skip(1).Where(node => !node.IsEmpty))
        {
            var text = GetText(hashElement.PreviousNode);
            if (text != null)
                colors.Add(new Text(text));
            colors.Add(new Text(hashElement.Value.Trim(), hashElement.Name.ToString()));
        }

        // handle trailing content
        var lastText = GetText(doc.Descendants().Last().NextNode);
        if (lastText != null)
            colors.Add(new Text(lastText));

        // print
        foreach (var color in colors)
            Console.WriteLine($"{color.Color}: {color.Content}");
    }

    private static string GetText(XNode node)=> (node as XText)?.Value.Trim();

    public class Text
    {
        public string Content { get; set; }
        public string Color { get; set; }

        public Text(string content, string color = "Black")
        {
            Color = color;
            Content = content;
        }
    }
}

输出

Black: Hello World
Red: This is some red text
Black: This is normal
Blue: This is blue text
告诫:欢迎任何帮助。我的Linq-to-xml可能有点生锈。

答案 1 :(得分:2)

你可以试试这个:

string sentence = "Hello World <Red>This is some red text </Red> This is normal <Blue>This is blue text </Blue>";
string[] matchSegments = Regex.Split(sentence,@"(<\w+>)(.*?)<\/\w+>");
foreach (string value in matchSegments)
{
    if(value.Contains("<") && value.Contains(">"))
        Console.Write(value);
    else
        Console.WriteLine(value);   
}

<强>输出:

Hello World
<Red>This is some red text
 This is normal
<Blue>This is blue text

Run the code here