Question

我想使用.Net Regex.Split方法将此输入字符串拆分为数组。 它必须在空格上拆分，除非它包含在引号中。

输入：这是“我的字符串”它有“六个匹配”

预期产出：

这里
是
my string
它
具有
六场比赛

我需要什么样的模式？我还需要指定任何RegexOptions吗？

Answer 1

无需选项

正则表达式：

\w+|"[\w\s]*"

C＃：

Regex regex = new Regex(@"\w+|""[\w\s]*""");

或者如果您需要排除“字符

    Regex
        .Matches(input, @"(?<match>\w+)|\""(?<match>[\w\s]*)""")
        .Cast<Match>()
        .Select(m => m.Groups["match"].Value)
        .ToList()
        .ForEach(s => Console.WriteLine(s));

Answer 2

Lieven的解决方案大部分都在那里，正如他在评论中所述，这只是将结局改为Bartek解决方案的问题。最终结果是以下工作regEx：

(?<=")\w[\w\s]*(?=")|\w+|"[\w\s]*"

输入：这是“我的字符串”它有“六个匹配”

输出：

这里
是
“my string”
它
具有
“六场比赛”

不幸的是它包含了引号。如果您改为使用以下内容：

(("((?<token>.*?)(?<!\\)")|(?<token>[\w]+))(\s)*)

并明确捕获“令牌”匹配，如下所示：

    RegexOptions options = RegexOptions.None;
    Regex regex = new Regex( @"((""((?<token>.*?)(?<!\\)"")|(?<token>[\w]+))(\s)*)", options );
    string input = @"   Here is ""my string"" it has   "" six  matches""   ";
    var result = (from Match m in regex.Matches( input ) 
                  where m.Groups[ "token" ].Success
                  select m.Groups[ "token" ].Value).ToList();

    for ( int i = 0; i < result.Count(); i++ )
    {
        Debug.WriteLine( string.Format( "Token[{0}]: '{1}'", i, result[ i ] ) );
    }

调试输出：

Token[0]: 'Here'
Token[1]: 'is'
Token[2]: 'my string'
Token[3]: 'it'
Token[4]: 'has'
Token[5]: ' six  matches'

Answer 3

最重要的答案并不适合我。我试图用空格分割这种字符串，但它看起来像是分裂点（'。'）。

"the lib.lib" "another lib".lib

我知道这个问题是关于正则表达式的，但我最后编写了一个非正则表达式函数来执行此操作：

    /// <summary>
    /// Splits the string passed in by the delimiters passed in.
    /// Quoted sections are not split, and all tokens have whitespace
    /// trimmed from the start and end.
    public static List<string> split(string stringToSplit, params char[] delimiters)
    {
        List<string> results = new List<string>();

        bool inQuote = false;
        StringBuilder currentToken = new StringBuilder();
        for (int index = 0; index < stringToSplit.Length; ++index)
        {
            char currentCharacter = stringToSplit[index];
            if (currentCharacter == '"')
            {
                // When we see a ", we need to decide whether we are
                // at the start or send of a quoted section...
                inQuote = !inQuote;
            }
            else if (delimiters.Contains(currentCharacter) && inQuote == false)
            {
                // We've come to the end of a token, so we find the token,
                // trim it and add it to the collection of results...
                string result = currentToken.ToString().Trim();
                if (result != "") results.Add(result);

                // We start a new token...
                currentToken = new StringBuilder();
            }
            else
            {
                // We've got a 'normal' character, so we add it to
                // the curent token...
                currentToken.Append(currentCharacter);
            }
        }

        // We've come to the end of the string, so we add the last token...
        string lastResult = currentToken.ToString().Trim();
        if (lastResult != "") results.Add(lastResult);

        return results;
    }

Answer 4

我正在使用Bartek Szabat的答案，但我需要捕获的不仅仅是“\ w”字符。为了解决这个问题，我稍微修改了他的正则表达式，类似于Grzenio的回答：

Regular Expression: (?<match>[^\s"]+)|(?<match>"[^"]*")

C# String:          (?<match>[^\\s\"]+)|(?<match>\"[^\"]*\")

Bartek的代码（返回标记被删除的引号）变为：

Regex
        .Matches(input, "(?<match>[^\\s\"]+)|(?<match>\"[^\"]*\")")
        .Cast<Match>()
        .Select(m => m.Groups["match"].Value)
        .ToList()
        .ForEach(s => Console.WriteLine(s));

Answer 5

我发现这个answer中的正则表达式非常有用。要使它在C＃中工作，您必须使用MatchCollection类。

//need to escape \s
string pattern = "[^\\s\"']+|\"([^\"]*)\"|'([^']*)'";

MatchCollection parsedStrings = Regex.Matches(line, pattern);

for (int i = 0; i < parsedStrings.Count; i++)
{
    //print parsed strings
    Console.Write(parsedStrings[i].Value + " ");
}
Console.WriteLine();

Answer 6

此正则表达式将根据您上面给出的情况进行拆分，虽然它不会删除引号或额外的空格，因此您可能希望对字符串进行一些后期处理。这应该正确地将引用的字符串保持在一起。

"[^"]+"|\s?\w+?\s

Answer 7

有一点点混乱，常规语言可以跟踪引号的偶数/奇数计数，但是如果你的数据可以包含转义引号（\“），那么你在制作或理解正则表达式时会遇到麻烦正确处理。

Answer 8

编辑：对不起我以前的帖子，这显然是可能的。

要处理所有非字母数字字符，您需要这样的内容：

MatchCollection matchCollection = Regex.Matches(input, @"(?<match>[^""\s]+)|\""(?<match>[^""]*)""");
foreach (Match match in matchCollection)
        {
            yield return match.Groups["match"].Value;
        }

如果您使用.Net＆gt; 2.0

，您可以使foreach变得更聪明

Answer 9

肖恩，

我相信以下正则表达式应该这样做

(?<=")\w[\w\s]*(?=")|\w+

的问候，
利芬

Answer 10

在Code项目中查看LSteinle的“Split Function that Supports Text Qualifiers”

以下是您感兴趣的项目片段。

using System.Text.RegularExpressions;

public string[] Split(string expression, string delimiter, string qualifier, bool ignoreCase)
{
    string _Statement = String.Format("{0}(?=(?:[^{1}]*{1}[^{1}]*{1})*(?![^{1}]*{1}))", 
                        Regex.Escape(delimiter), Regex.Escape(qualifier));

    RegexOptions _Options = RegexOptions.Compiled | RegexOptions.Multiline;
    if (ignoreCase) _Options = _Options | RegexOptions.IgnoreCase;

    Regex _Expression = New Regex(_Statement, _Options);
    return _Expression.Split(expression);
}

请注意在循环中调用它，因为每次调用它时都会创建并编译Regex语句。因此，如果您需要多次调用它，我会考虑创建某种类型的Regex缓存。

除非在引号中，否则正则表达式将在空格上分割

11 个答案: