最小化LINQ字符串标记计数器

时间:2010-10-28 00:39:47

标签: c# .net linq string

an earlier question的回答进行跟进。

有没有办法进一步减少这种情况,避免外部String.Split电话?目标是{token, count}的关联容器。

string src = "for each character in the string, take the rest of the " +
    "string starting from that character " +
    "as a substring; count it if it starts with the target string";

string[] target = src.Split(new char[] { ' ' });

var results = target.GroupBy(t => new
{
    str = t,
    count = target.Count(sub => sub.Equals(t))
});

4 个答案:

答案 0 :(得分:4)

正如你现在所做的那样,它会起作用(在某种程度上),但是非常低效。因此,结果是分组的枚举,而不是您可能想到的(单词,计数)对。

GroupBy()的重载需要一个函数来选择密钥。您正在为集合中的每个项目有效地执行该计算。如果没有使用忽略标点符号的正则表达式的路线,它应该这样写:

string src = "for each character in the string, take the rest of the " +
             "string starting from that character " +
             "as a substring; count it if it starts with the target string";

var results = src.Split()               // default split by whitespace
                 .GroupBy(str => str)   // group words by the value
                 .Select(g => new
                              {
                                  str = g.Key,      // the value
                                  count = g.Count() // the count of that value
                              });

// sort the results by the words that were counted
var sortedResults = results.OrderByDescending(p => p.str);

答案 1 :(得分:3)

虽然慢了3-4倍,但Regex方法可以说更准确:

string src = "for each character in the string, take the rest of the " +
    "string starting from that character " +
    "as a substring; count it if it starts with the target string";

var regex=new Regex(@"\w+",RegexOptions.Compiled);
var sw=new Stopwatch();

for (int i = 0; i < 100000; i++)
{
    var dic=regex
        .Matches(src)
        .Cast<Match>()
        .Select(m=>m.Value)
        .GroupBy(s=>s)
        .ToDictionary(g=>g.Key,g=>g.Count());
    if(i==1000)sw.Start();
}
Console.WriteLine(sw.Elapsed);

sw.Reset();

for (int i = 0; i < 100000; i++)
{
    var dic=src
        .Split(' ')
        .GroupBy(s=>s)
        .ToDictionary(g=>g.Key,g=>g.Count());
    if(i==1000)sw.Start();
}
Console.WriteLine(sw.Elapsed);

例如,正则表达式方法不会将stringstring,计为两个单独的条目,并且会正确标记substring而不是substring;

修改

阅读上一个问题并了解我的代码并不完全符合您的规范。无论如何,它仍然展示了使用Regex的优势/成本。

答案 2 :(得分:1)

这是一个没有ToDictionary()的LINQ版本,根据您的需要可能会增加不必要的开销......

var dic = src.Split(' ').GroupBy(s => s, (str, g) => new { str, count = g.Count() });

或者在查询语法中......

var dic = from str in src.Split(' ')
          group str by str into g
          select new { str, count = g.Count() };

答案 3 :(得分:1)

摆脱String.Split并没有留下很多选项。一个选项是Regex.Matches spender demonstrated,另一个是Regex.Split(它不会给我们任何新内容)。

除了分组,您可以使用以下任一方法:

var target = src.Split(new[] { ' ', ',', ';' }, StringSplitOptions.RemoveEmptyEntries);
var result = target.Distinct()
                   .Select(s => new { Word = s, Count = target.Count(w => w == s) });

// or dictionary approach
var result = target.Distinct()
                   .ToDictionary(s => s, s => target.Count(w => w == s));

需要Distinct调用以避免重复项目。我继续前进并扩展角色以分开以获得没有标点符号的实际单词。我找到了第一种使用消费者基准测试代码最快的方法。

回到从您之前引用的问题中订购结果的要求,您可以轻松扩展第一种方法,如下所示:

var result = target.Distinct()
                   .Select(s => new { Word = s, Count = target.Count(w => w == s) })
                   .OrderByDescending(o => o.Count);

// or in query form

var result = from s in target.Distinct()
             let count = target.Count(w => w == s)
             orderby count descending
             select new { Word = s, Count = count };
编辑:由于匿名类型即将到来,摆脱了元组。