LINQ - 拆分最大长度的字符串,但不要分开单词

时间:2010-12-29 17:12:23

标签: c# linq

我有一个简单的LINQ扩展方法......

    public static IEnumerable<string> SplitOnLength(this string input, int length)
    {
        int index = 0;
        while (index < input.Length)
        {
            if (index + length < input.Length)
                yield return input.Substring(index, length);
            else
                yield return input.Substring(index);

            index += length;
        }
    }

这需要一个字符串,然后将其切换为不超过给定长度的字符串集合。

这很好用 - 但是我想更进一步。它将文字分成两半。我不需要它来理解任何复杂的东西,我只是希望它能够在'{1}}切割它时能够在'早期'切断一个字符串将在文本中间切割(基本上任何不是' t whitespace)。

但是我很害羞LINQ,所以我想知道是否有人知道如何解决这个问题。我知道我要做什么,但我不知道如何处理它。

所以我想说我有以下文字。

  

这是我将通过字符串拆分器的示例文本块。

我称之为length 我会得到以下内容。

  • 这个我
  • s a sa
  • mple b
  • lock o
  • f text
  • 传递t
  • hrough
  • s
  • tring
  • splitt
  • ER。

我宁愿它足够聪明地停下来看起来更像......

  • 样本
  

//糟糕的例子,因为单个单词超过了最大长度,但在实际场景中长度会更大,接近200。

任何人都可以帮助我吗?

8 个答案:

答案 0 :(得分:4)

答案 1 :(得分:2)

尝试使用String.Split(' ')将单个字符串转换为单个单词的数组。然后,遍历它们,构建最长的字符串(重新添加空格)小于限制,附加换行符并生成。

答案 2 :(得分:2)

我将通过for循环解决这个问题:

var ret1 = str.Split(' ');
var ret2 = new List<string>();
ret2.Add("");
int index = 0;
foreach (var item in ret1)
{
    if (item.Length + 1 + ret2[index].Length <= allowedLength)
    {
        ret2[index] += ' ' + item;
        if (ret2[index].Length >= allowedLength)
        {
            ret2.Add("");
            index++;
        }
    }
    else
    {
        ret2.Add(item);
        index++;
    }
}
return ret2;

首先我想到了Zip,但这里并不好。

和不同的执行版本与yield:

public static IEnumerable<string> SaeedsApproach(this string str, int allowedLength)
{
    var ret1 = str.Split(' ');
    string current = "";
    foreach (var item in ret1)
    {
        if (item.Length + 1 + current.Length <= allowedLength)
        {
            current += ' ' + item;
            if (current.Length >= allowedLength)
            {
                yield return current;
                current = "";
            }
        }
        else
        {
            yield return current;
            current = "";
        }
    }
}

答案 3 :(得分:1)

我也习惯LINQ :-),但是这里的代码可以在不分配任何内容的情况下工作(当然除了输出字),删除空格,删除空字符串,修剪字符串,永不断言(这是设计选择) - 我有兴趣看到完整的LINQ等价物:

public static IEnumerable<string> SplitOnLength(this string input, int length)
{
    if (input == null)
        yield break;

    string chunk;
    int current = 0;
    int lastSep = -1;
    for (int i = 0; i < input.Length; i++)
    {
        if (char.IsSeparator(input[i]))
        {
            lastSep = i;
            continue;
        }

        if ((i - current) >= length)
        {
            if (lastSep < 0) // big first word case
                continue;

            chunk = input.Substring(current, lastSep - current).Trim();
            if (chunk.Length > 0)
                yield return chunk;

            current = lastSep;
        }
    }
    chunk = input.Substring(current).Trim();
    if (chunk.Length > 0)
        yield return chunk;
}

答案 4 :(得分:1)

我不得不提出一个答案,因为我觉得其他答案过于依赖索引和复杂的逻辑。我认为我的答案相当简单。

public static IEnumerable<string> SplitOnLength(this string input, int length)
{
    var words = input.Split(new [] { " ", }, StringSplitOptions.None);
    var result = words.First();
    foreach (var word in words.Skip(1))
    {
        if (result.Length + word.Length > length)
        {
            yield return result;
            result = word;
        }
        else
        {
            result += " " + word;
        }
    }
    yield return result;
}

OP中提供的样本字符串的结果是:

This is
a
sample
block
of text
that I
would
pass
through
the
string
splitter.

答案 5 :(得分:0)

更新

单线:

    public static IEnumerable<string> SplitOnLength(this string s, int length)
    {
        return Regex.Split(s, @"(.{0," + length + @"}) ")
            .Where(x => x != string.Empty);
    }

我已经将这个与已接受的答案进行了分析,其中~9,300个字符源(lorem ipsum x4)在200个字符之前或之前分割。

10,000次通过:
  - 循环需要约4,200毫秒
  - 我的需要大约1,200毫秒

原始答案:

此方法会缩短结果以避免破坏单词,除非单词超出指定长度,否则会破坏它。

public static IEnumerable<string> SplitOnLength(this string s, int length)
{
    var pattern = @"^.{0," + length + @"}\W";
    var result = Regex.Match(s, pattern).Groups[0].Value;

    if (result == string.Empty)
    {
        if (s == string.Empty) yield break;
        result = s.Substring(0, length);
    }

    yield return result;

    foreach (var subsequent_result in SplitOnLength(s.Substring(result.Length), length))
    {
        yield return subsequent_result;
    }
}

答案 6 :(得分:0)

public static IEnumerable<string> SplitOnLength(this string source,int maxLength)
{
   //check parameters' validity and then
   int currentIndex = 0;
   while (currentIndex + maxLength < source.Length)
   {
      int prevIndex = currentIndex;
      currentIndex += maxLength;
      while (currentIndex >= 0 && source[currentIndex] != ' ') currentIndex--;
      if (currentIndex <= prevIndex)
             throw new ArgumentException("invalid maxLength");
      yield return source.Substring(prevIndex, currentIndex - prevIndex);
      currentIndex++;
   }
   yield return source.Substring(currentIndex);
}

测试用例:

"this is a test".SplitOnLength(5).ToList()
                .ForEach(x => Console.WriteLine("|" + x + "|"));

输出:

|this|
|is a|
|test|

答案 7 :(得分:0)

好的,我测试了各种方式(jay RegEx方式,而不是LINQ方式)当我将一个维护词设置为真所以我跳过它时,Dan Taos方式也有例外,所以我跳过它。

这就是我所做的:

List<DifferentTypes> smallStrings = new List<DifferentTypes>();
List<DifferentTypes> mediomStrings = new List<DifferentTypes>();
List<DifferentTypes> largeStrings = new List<DifferentTypes>();

for (int i = 0; i < 10; i++)
{
    string strSmallTest = "This is a small string test for different approachs provided here.";

    smallStrings.Add(Approachs(strSmallTest, "small"));

    string mediomSize = "Any public static (Shared in Visual Basic) members of this type are thread safe. Any instance members are not guaranteed to be thread safe."
                        + "Windows 7, Windows Vista SP1 or later, Windows XP SP3, Windows Server 2008 (Server Core Role not supported), Windows Server 2008 R2 "
                        + "(Server Core Role not supported), Windows Server 2003 SP2"
                        + " .NET Framework does not support all versions of every platform. For a list of the supported versions, see .NET Framework System Requirements. ";
    mediomStrings.Add(Approachs(mediomSize, "Mediom"));

    string largeSize =
        "This is a question that I get very frequently, and I always tried to dodged the bullet, but I get it so much that I feel that I have to provide an answer. Obviously, I am (not so) slightly biased toward NHibernate, so while you read it, please keep it in mind." +
        "EF 4.0 has done a lot to handle the issues that were raised with the previous version of EF. Thinks like transparent lazy loading, POCO classes, code only, etc. EF 4.0 is a much nicer than EF 1.0." +
        "The problem is that it is still a very young product, and the changes that were added only touched the surface. I already talked about some of my problems with the POCO model in EF, so I won’t repeat that, or my reservations with the Code Only model. But basically, the major problem that I have with those two is that there seems to be a wall between what experience of the community and what Microsoft is doing. Both of those features shows much of the same issues that we have run into with NHibernate and Fluent NHibernate. Issues that were addressed and resolved, but show up in the EF implementations." +
        "Nevertheless, even ignoring my reservations about those, there are other indications that NHibernate’s maturity makes itself known. I run into that several times while I was writing the guidance for EF Prof, there are things that you simple can’t do with EF, that are a natural part of NHibernate." +
        "I am not going to try to do a point by point list of the differences, but it is interesting to look where we do find major differences between the capabilities of NHibernate and EF 4.0. Most of the time, it is in the ability to fine tune what the framework is actually doing. Usually, this is there to allow you to gain better performance from the system without sacrificing the benefits of using an OR/M in the first place.";

    largeStrings.Add(Approachs(largeSize, "Large"));

    Console.WriteLine();
}

Console.WriteLine("/////////////////////////");
Console.WriteLine("average small for saeed: {0}", smallStrings.Average(x => x.saeed));
Console.WriteLine("average small for Jay: {0}", smallStrings.Average(x => x.Jay));
Console.WriteLine("average small for Simmon: {0}", smallStrings.Average(x => x.Simmon));

Console.WriteLine("/////////////////////////");
Console.WriteLine("average mediom for saeed: {0}", mediomStrings.Average(x => x.saeed));
Console.WriteLine("average mediom for Jay: {0}", mediomStrings.Average(x => x.Jay));
Console.WriteLine("average mediom for Simmon: {0}", mediomStrings.Average(x => x.Simmon));

Console.WriteLine("/////////////////////////");
Console.WriteLine("average large for saeed: {0}", largeStrings.Average(x => x.saeed));
Console.WriteLine("average large for Jay: {0}", largeStrings.Average(x => x.Jay));
Console.WriteLine("average large for Simmon: {0}", largeStrings.Average(x => x.Simmon));

private static DifferentTypes Approachs(string stringToDecompose, string text2Write)
{
    DifferentTypes differentTypes;
    Stopwatch sw = new Stopwatch();
    sw.Start();
    for (int i = 0; i < 1000; i++)
    {
        var strs = stringToDecompose.SaeedsApproach(10);
        foreach (var item in strs)
        { 
        }
    }
    sw.Stop();
    Console.WriteLine("Saeed's Approach takes {0} millisecond for {1} strings", sw.ElapsedMilliseconds, text2Write);
    differentTypes.saeed = sw.ElapsedMilliseconds;

    sw.Restart();
    for (int i = 0; i < 1000; i++)
    {
        var strs = stringToDecompose.JaysApproach(10);
        foreach (var item in strs)
        {
        }
    }
    sw.Stop();
    Console.WriteLine("Jay's Approach takes {0} millisecond for {1} strings", sw.ElapsedMilliseconds, text2Write);
    differentTypes.Jay = sw.ElapsedMilliseconds;

    sw.Restart();
    for (int i = 0; i < 1000; i++)
    {
        var strs = stringToDecompose.SimmonsApproach(10);
        foreach (var item in strs)
        {
        }
    }
    sw.Stop();
    Console.WriteLine("Simmon's Approach takes {0} millisecond for {1} strings", sw.ElapsedMilliseconds, text2Write);
    differentTypes.Simmon = sw.ElapsedMilliseconds;

    return differentTypes;
}

结果:

average small for saeed: 4.6
average small for Jay: 33.9
average small for Simmon: 5.6

average mediom for saeed: 28.7
average mediom for Jay: 173.9
average mediom for Simmon: 38.7

average large for saeed: 115.3
average large for Jay: 594.2
average large for Simmon: 138.7

只需在您的电脑上进行测试,您可以随意编辑它以保留测试结果或改善当前功能。我确定如果我们用更大的字符串测试它,我们可以看到我的方法与你的方法之间存在很大差异。

编辑:我编辑了使用foreach和yield的方法,请参阅上面的代码。结果是:

average small for saeed: 6.5
average small for Jay: 34.5
average small for Simmon: 5.9

average mediom for saeed: 30.6
average mediom for Jay: 157.9
average mediom for Simmon: 35

average large for saeed: 122.4
average large for Jay: 584
average large for Simmon: 157

这是我(周杰伦)的测试:

class Program
{
    static void Main()
    {
        var s =
            "Lorem ipsum dolor sit amet, consectetuer adipiscing elit, " 
            + "sed diam nonummy nibh euismod tincidunt ut laoreet dolore " 
            + "magna aliquam erat volutpat. Ut wisi enim ad minim veniam, " 
            + "quis nostrud exerci tation ullamcorper suscipit lobortis nisl " 
            + "ut aliquip ex ea commodo consequat. Duis autem " 
            + "vel eum iriure dolor in hendrerit in vulputate velit " 
            + "esse molestie consequat, vel illum dolore eu feugiat " 
            + "nulla facilisis at vero eros et accumsan et iusto " 
            + "odio dignissim qui blandit praesent luptatum zzril delenit augue " 
            + "duis dolore te feugait nulla facilisi. Nam liber tempor " 
            + "cum soluta nobis eleifend option congue nihil imperdiet doming id " 
            + "quod mazim placerat facer possim assum. Typi non habent " 
            + "claritatem insitam; est usus legentis in iis qui facit " 
            + "eorum claritatem. Investigationes demonstraverunt lectores legere me lius quod " 
            + "ii legunt saepius. Claritas est etiam processus dynamicus, " 
            + "qui sequitur mutationem consuetudium lectorum. Mirum est notare quam " 
            + "littera gothica, quam nunc putamus parum claram, anteposuerit " 
            + "litterarum formas humanitatis per seacula quarta decima et quinta decima" 
            + ". Eodem modo typi, qui nunc nobis videntur parum clari" 
            + ", fiant sollemnes in futurum.";
        s += s;
        s += s;
        s += s;

        var watch = new Stopwatch();

        watch.Start();
        for (int i = 1; i <= 10000; i++) s.JaysApproach(200).ToList();
        watch.Stop();

        Console.WriteLine("Jay:   {0}", watch.ElapsedTicks / 10000);

        watch.Reset();

        watch.Start();
        for (int i = 1; i <= 10000; i++) s.SaeedsApproach(200);
        watch.Stop();

        Console.WriteLine("Saeed: {0}", watch.ElapsedTicks / 10000);

        watch.Reset();

        watch.Start();
        for (int i = 1; i <= 10000; i++) s.SimonsApproach(200).ToList();
        watch.Stop();

        Console.WriteLine("Simon: {0}", watch.ElapsedTicks / 10000);

        Console.ReadLine();
    }
}

结果:

4 lorem ipsums (as shown):
    Jay:   317
    Saeed: 1069
    Simon: 599

3 lorems ipsums:
    Jay:   283
    Saeed: 862
    Simon: 465

2 lorem ipsums:
    Jay:   189
    Saeed: 417
    Simon: 236

1 lorem ipsum:
    Jay:   113
    Saeed: 204
    Simon: 118