在C#中是否有一个懒惰的`String.Split`

时间:2015-01-27 19:33:32

标签: c# string ienumerable lazy-evaluation enumerator

所有string.Split方法似乎都返回一个字符串数组(string[])。

我想知道是否有一个惰性变体返回一个IEnumerable<string>,使得一个变量用于大字符串(或无限长度IEnumerable<char>),当一个只对第一个子序列感兴趣时,节省计算工作量和内存。如果字符串由设备/程序(网络,终端,管道)构成,并且因此不需要立即完全可用,则它也可能是有用的。这样就可以处理第一次出现了。

.NET框架中是否有这样的方法?

7 个答案:

答案 0 :(得分:4)

内置没有这样的东西。如果我正确地解释反编译代码,Regex.Matches是懒惰的。也许你可以利用它。

或者,您只需编写自己的分割功能。

实际上,您可以将大多数string函数的图像一般化为任意序列。通常,甚至是T的序列,而不仅仅是char。 BCL并没有强调所有的概括。例如,没有Enumerable.Subsequence

答案 1 :(得分:4)

你可以轻松写一个:

public static class StringExtensions
{
    public static IEnumerable<string> Split(this string toSplit, params char[] splits)
    {
        if (string.IsNullOrEmpty(toSplit))
            yield break;

        StringBuilder sb = new StringBuilder();

        foreach (var c in toSplit)
        {
            if (splits.Contains(c))
            {
                yield return sb.ToString();
                sb.Clear();
            }
            else
            {
                sb.Append(c);
            }
        }

        if (sb.Length > 0)
            yield return sb.ToString();
    }
}

显然,我没有测试它与string.split的奇偶校验,但我相信它应该可以正常工作。

正如Servy所说,这不会分裂为字符串。这不是那么简单,也不是那么有效,但它基本上是相同的模式。

public static IEnumerable<string> Split(this string toSplit, string[] separators)
{
    if (string.IsNullOrEmpty(toSplit))
        yield break;

    StringBuilder sb = new StringBuilder();
    foreach (var c in toSplit)
    {
        var s = sb.ToString();
        var sep = separators.FirstOrDefault(i => s.Contains(i));
        if (sep != null)
        {
            yield return s.Replace(sep, string.Empty);
            sb.Clear();
        }
        else
        {
            sb.Append(c);
        }
    }

    if (sb.Length > 0)
        yield return sb.ToString();
}

答案 2 :(得分:3)

没有任何内置功能,但可以随意翻录我的Tokenize方法:

 /// <summary>
/// Splits a string into tokens.
/// </summary>
/// <param name="s">The string to split.</param>
/// <param name="isSeparator">
/// A function testing if a code point at a position
/// in the input string is a separator.
/// </param>
/// <returns>A sequence of tokens.</returns>
IEnumerable<string> Tokenize(string s, Func<string, int, bool> isSeparator = null)
{
    if (isSeparator == null) isSeparator = (str, i) => !char.IsLetterOrDigit(str, i);

    int startPos = -1;

    for (int i = 0; i < s.Length; i += char.IsSurrogatePair(s, i) ? 2 : 1)
    {
        if (!isSeparator(s, i))
        {
            if (startPos == -1) startPos = i;
        }
        else if (startPos != -1)
        {
            yield return s.Substring(startPos, i - startPos);
            startPos = -1;
        }
    }

    if (startPos != -1)
    {
        yield return s.Substring(startPos);
    }
}

答案 3 :(得分:1)

据我所知,没有内置方法可以做到这一点。但这并不意味着你不能写一个。这是一个给你一个想法的例子:

public static IEnumerable<string> SplitLazy(this string str, params char[] separators)
{
    List<char> temp = new List<char>();
    foreach (var c in str)
    {
        if (separators.Contains(c) && temp.Any())
        {
             yield return new string(temp.ToArray());
             temp.Clear();
        }
        else
        {
            temp.Add(c);
        }
    }
    if(temp.Any()) { yield return new string(temp.ToArray()); }
}

当然,这并没有处理所有案件,可以改进。

答案 4 :(得分:1)

我写了这个变种,它也支持SplitOptions和count。 它的行为与string.Split相同,在我试过的所有测试用例中。 nameof运算符是C#6 sepcific,可以用&#34; count&#34;替换。

public static class StringExtensions
{
    /// <summary>
    /// Splits a string into substrings that are based on the characters in an array. 
    /// </summary>
    /// <param name="value">The string to split.</param>
    /// <param name="options"><see cref="StringSplitOptions.RemoveEmptyEntries"/> to omit empty array elements from the array returned; or <see cref="StringSplitOptions.None"/> to include empty array elements in the array returned.</param>
    /// <param name="count">The maximum number of substrings to return.</param>
    /// <param name="separator">A character array that delimits the substrings in this string, an empty array that contains no delimiters, or null. </param>
    /// <returns></returns>
    /// <remarks>
    /// Delimiter characters are not included in the elements of the returned array. 
    /// If this instance does not contain any of the characters in separator the returned sequence consists of a single element that contains this instance.
    /// If the separator parameter is null or contains no characters, white-space characters are assumed to be the delimiters. White-space characters are defined by the Unicode standard and return true if they are passed to the <see cref="Char.IsWhiteSpace"/> method.
    /// </remarks>
    public static IEnumerable<string> SplitLazy(this string value, int count = int.MaxValue, StringSplitOptions options = StringSplitOptions.None, params char[] separator)
    {
        if (count <= 0)
        {
            if (count < 0) throw new ArgumentOutOfRangeException(nameof(count), "Count cannot be less than zero.");
            yield break;
        }

        Func<char, bool> predicate = char.IsWhiteSpace;
        if (separator != null && separator.Length != 0)
            predicate = (c) => separator.Contains(c);

        if (string.IsNullOrEmpty(value) || count == 1 || !value.Any(predicate))
        {
            yield return value;
            yield break;
        }

        bool removeEmptyEntries = (options & StringSplitOptions.RemoveEmptyEntries) != 0;
        int ct = 0;
        var sb = new StringBuilder();
        for (int i = 0; i < value.Length; ++i)
        {
            char c = value[i];
            if (!predicate(c))
            {
                sb.Append(c);
            }
            else
            {
                if (sb.Length != 0)
                {
                    yield return sb.ToString();
                    sb.Clear();
                }
                else
                {
                    if (removeEmptyEntries)
                        continue;
                    yield return string.Empty;
                }

                if (++ct >= count - 1)
                {
                    if (removeEmptyEntries)
                        while (++i < value.Length && predicate(value[i]));
                    else
                        ++i;
                    if (i < value.Length - 1)
                    {
                        sb.Append(value, i, value.Length - i);
                        yield return sb.ToString();
                    }
                    yield break;
                }
            }
        }

        if (sb.Length > 0)
            yield return sb.ToString();
        else if (!removeEmptyEntries && predicate(value[value.Length - 1]))
            yield return string.Empty;
    }

    public static IEnumerable<string> SplitLazy(this string value, params char[] separator)
    {
        return value.SplitLazy(int.MaxValue, StringSplitOptions.None, separator);
    }

    public static IEnumerable<string> SplitLazy(this string value, StringSplitOptions options, params char[] separator)
    {
        return value.SplitLazy(int.MaxValue, options, separator);
    }

    public static IEnumerable<string> SplitLazy(this string value, int count, params char[] separator)
    {
        return value.SplitLazy(count, StringSplitOptions.None, separator);
    }
}

答案 5 :(得分:0)

我想要Regex.Split的功能,但是以懒惰的评估形式。下面的代码只运行输入字符串中的所有Matches,并产生与Regex.Split相同的结果:

{a=b, c=d, e=f}

请注意,将此与模式参数public static IEnumerable<string> Split(string input, string pattern, RegexOptions options = RegexOptions.None) { // Always compile - we expect many executions var regex = new Regex(pattern, options | RegexOptions.Compiled); int currentSplitStart = 0; var match = regex.Match(input); while (match.Success) { yield return input.Substring(currentSplitStart, match.Index - currentSplitStart); currentSplitStart = match.Index + match.Length; match = match.NextMatch(); } yield return input.Substring(currentSplitStart); } 一起使用会得到与string.Split()相同的结果。

答案 6 :(得分:0)

懒惰拆分而不创建临时字符串。

使用系统coll mscorlib String.SubString复制的字符串块。

public static IEnumerable<string> LazySplit(this string source, StringSplitOptions stringSplitOptions, params string[] separators)
{
    var sourceLen = source.Length;

    bool IsSeparator(int index, string separator)
    {
        var separatorLen = separator.Length;

        if (sourceLen < index + separatorLen)
        {
            return false;
        }

        for (var i = 0; i < separatorLen; i++)
        {
            if (source[index + i] != separator[i])
            {
                return false;
            }
        }

        return true;
    }

    var indexOfStartChunk = 0;

    for (var i = 0; i < source.Length; i++)
    {
        foreach (var separator in separators)
        {
            if (IsSeparator(i, separator))
            {
                if (indexOfStartChunk == i && stringSplitOptions != StringSplitOptions.RemoveEmptyEntries)
                {
                    yield return string.Empty;
                }
                else
                {
                    yield return source.Substring(indexOfStartChunk, i - indexOfStartChunk);
                }

                i += separator.Length;
                indexOfStartChunk = i--;
                break;
            }
        }
    }

    if (indexOfStartChunk != 0)
    {
        yield return source.Substring(indexOfStartChunk, sourceLen - indexOfStartChunk);
    }
}