如何使用正则表达式拆分字符串

时间:2011-11-14 15:50:56

标签: c# regex string

我想将字符串拆分为列表或数组。

输入:green,"yellow,green",white,orange,"blue,black"

拆分字符是逗号(,),但它必须忽略引号内的逗号。

输出应为:

  • 绿色
  • 黄色,绿色
  • 蓝色,黑色

感谢。

4 个答案:

答案 0 :(得分:12)

实际上这很容易使用匹配:

        string subjectString = @"green,""yellow,green"",white,orange,""blue,black""";
        try
        {
            Regex regexObj = new Regex(@"(?<="")\b[a-z,]+\b(?="")|[a-z]+", RegexOptions.IgnoreCase);
            Match matchResults = regexObj.Match(subjectString);
            while (matchResults.Success)
            {
                Console.WriteLine("{0}", matchResults.Value);
                // matched text: matchResults.Value
                // match start: matchResults.Index
                // match length: matchResults.Length
                matchResults = matchResults.NextMatch();
            }

输出:

green
yellow,green
white
orange
blue,black

说明:

@"
             # Match either the regular expression below (attempting the next alternative only if this one fails)
   (?<=         # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind)
      ""            # Match the character “""” literally
   )
   \b           # Assert position at a word boundary
   [a-z,]       # Match a single character present in the list below
                   # A character in the range between “a” and “z”
                   # The character “,”
      +            # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
   \b           # Assert position at a word boundary
   (?=          # Assert that the regex below can be matched, starting at this position (positive lookahead)
      ""            # Match the character “""” literally
   )
|            # Or match regular expression number 2 below (the entire match attempt fails if this one fails to match)
   [a-z]        # Match a single character in the range between “a” and “z”
      +            # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
"

答案 1 :(得分:5)

你所拥有的是一种不规范的语言。换句话说,字符的含义取决于字符之前或之后的字符序列。顾名思义,正则表达式用于解析常规语言。

您需要的是TokenizerParser,一个优秀的互联网搜索引擎应该引导您进行示例。事实上,因为令牌只是字符,你可能甚至不需要Tokenizer。

虽然你可以使用正则表达式来完成这个简单的情况,但它可能非常慢。如果引号没有平衡,它也可能导致问题,因为正则表达式不会检测到这个错误,而解析器会这样。

如果您要导入CSV文件,您可能需要查看Microsoft.VisualBasic.FileIO.TextFieldParser类(只需在C#项目中添加对Microsoft.VisualBasic.dll的引用)即可解析CSV文件。

另一种方法是编写自己的state machine(例如下面的代码),尽管这仍然无法解决值中间的引用问题:

using System;
using System.Text;

namespace Example
{
    class Program
    {
        static void Main(string[] args)
        {
            string subjectString = @"green,""yellow,green"",white,orange,""blue,black""";

            bool inQuote = false;
            StringBuilder currentResult = new StringBuilder();
            foreach (char c in subjectString)
            {
                switch (c)
                {
                    case '\"':
                        inQuote = !inQuote;
                        break;

                    case ',':
                        if (inQuote)
                        {
                            currentResult.Append(c);
                        }
                        else
                        {
                            Console.WriteLine(currentResult);
                            currentResult.Clear();
                        }
                        break;

                    default:
                        currentResult.Append(c);
                        break;
                }
            }
            if (inQuote)
            {
                throw new FormatException("Input string does not have balanced Quote Characters");
            }
            Console.WriteLine(currentResult);
        }
    }
}

答案 2 :(得分:2)

有人会很快拿出一个用一个正则表达式来做到这一点的答案。我不是那么聪明,但为了平衡,这里有一个不完全使用正则表达式的建议。基于一句古老的格言,当你尝试用正则表达式解决问题时,你就会遇到两个问题。 :)

就个人而言,由于缺乏正则表达式,我会做以下其中一项:

  • 使用简单的基于正则表达式的Replace以其他内容(即"&comma;")转义 引号内的任何逗号。然后,您可以对结果执行简单的string.Split(),并在使用之前对结果数组中的每个项目进行unescape。这太可惜了。部分是因为它处理了所有事情,部分是因为它也使用了正则表达式。 Boooo!
  • 手工解析,char取char。将字符串转换为char数组,然后遍历它,记录您是否“在引号内”,并一次构建一个char。
  • 与之前的建议相同,但使用互联网上某人的csv-parser。我在下面创建的示例并没有完全通过csv规范中的所有测试,因此它只是一个说明我的观点的指南。

如果编写良好,非正则表达式选项很有可能表现得更好,因为正则它们在内部扫描字符串寻找模式时可能有点贵。

真的,我只想指出你不必使用正则表达式。 :)

这是我第二个建议的相当天真的实现。在我的电脑上,它很乐意在4.5秒内解析100万个15列字符串。

public class ManualParser : IParser
{
    public IEnumerable<string> Parse(string line)
    {
        if (string.IsNullOrWhiteSpace(line)) return new List<string>();

        line = line.Trim();

        if (line.Contains(",") == false) return new[] { line.Trim('"') };

        if (line.Contains("\"") == false) return line.Split(',').Select(c => c.Trim());

        bool withinQuotes = false;
        var builder = new List<string>();
        var trimChars = new[] { ' ', '"' };

        int left = 0;
        int right = 0;

        for (right = 0; right < line.Length; right++)
        {
            char c = line[right];

            if (c == '"')
            {
                withinQuotes = !withinQuotes;
                continue;
            }

            if (c == ',' && !withinQuotes)
            {
                builder.Add(line.Substring(left, right - left).Trim(trimChars));
                right++; // Jump the comma
                left = right;
            }
        }

        builder.Add(line.Substring(left, right - left).Trim(trimChars));

        return builder;
    }
}

以下是一些单元测试:

[TestFixture]
public class ManualParserTests
{
    [Test]
    public void Parse_GivenStringWithNoQuotesAndNoCommas_ShouldReturnThatString()
    {
        // Arrange
        var parser = new ManualParser();

        // Act
        string[] result = parser.Parse("This is my data").ToArray();

        // Assert
        Assert.AreEqual(1, result.Length, "Should only be one column returned");
        Assert.AreEqual("This is my data", result[0], "Incorrect value is returned");
    }

    [Test]
    public void Parse_GivenStringWithNoQuotesAndOneComma_ShouldReturnTwoColumns()
    {
        // Arrange
        var parser = new ManualParser();

        // Act
        string[] result = parser.Parse("This is, my data").ToArray();

        // Assert
        Assert.AreEqual(2, result.Length, "Should be 2 columns returned");
        Assert.AreEqual("This is", result[0], "First value is incorrect");
        Assert.AreEqual("my data", result[1], "Second value is incorrect");
    }

    [Test]
    public void Parse_GivenStringWithQuotesAndNoCommas_ShouldReturnColumnWithoutQuotes()
    {
        // Arrange
        var parser = new ManualParser();

        // Act
        string[] result = parser.Parse("\"This is my data\"").ToArray();

        // Assert
        Assert.AreEqual(1, result.Length, "Should be 1 column returned");
        Assert.AreEqual("This is my data", result[0], "Value is incorrect");
    }

    [Test]
    public void Parse_GivenStringWithQuotesAndCommas_ShouldReturnColumnsWithoutQuotes()
    {
        // Arrange
        var parser = new ManualParser();

        // Act
        string[] result = parser.Parse("\"This is\", my data").ToArray();

        // Assert
        Assert.AreEqual(2, result.Length, "Should be 2 columns returned");
        Assert.AreEqual("This is", result[0], "First value is incorrect");
        Assert.AreEqual("my data", result[1], "Second value is incorrect");
    }

    [Test]
    public void Parse_GivenStringWithQuotesContainingCommasAndCommas_ShouldReturnColumnsWithoutQuotes()
    {
        // Arrange
        var parser = new ManualParser();

        // Act
        string[] result = parser.Parse("\"This, is\", my data").ToArray();

        // Assert
        Assert.AreEqual(2, result.Length, "Should be 2 columns returned");
        Assert.AreEqual("This, is", result[0], "First value is incorrect");
        Assert.AreEqual("my data", result[1], "Second value is incorrect");
    }
}

这是一个示例应用程序,我测试了吞吐量:

class Program
{
    static void Main(string[] args)
    {
        RunTest();
    }

    private static void RunTest()
    {
        var parser = new ManualParser();
        string csv = Properties.Resources.Csv;
        var result = new StringBuilder();
        var s = new Stopwatch();

        for (int test = 0; test < 3; test++)
        {
            int lineCount = 0;

            s.Start();
            for (int i = 0; i < 1000000 / 50; i++)
            {
                foreach (var line in csv.Split(new[] { Environment.NewLine }, StringSplitOptions.None))
                {
                    string cur = line + s.ElapsedTicks.ToString();
                    result.AppendLine(parser.Parse(cur).ToString());
                    lineCount++;
                }
            }
            s.Stop();
            Console.WriteLine("Completed {0} lines in {1}ms", lineCount, s.ElapsedMilliseconds);
            s.Reset();
            result = new StringBuilder();
        }
    }
}

答案 3 :(得分:2)

您尝试拆分的字符串格式似乎是标准CSV。使用CSV解析器可能更容易/更快。