Question

有没有办法进一步减少这种情况，避免外部String.Split电话？目标是{token, count}的关联容器。

string src = "for each character in the string, take the rest of the " +
    "string starting from that character " +
    "as a substring; count it if it starts with the target string";

string[] target = src.Split(new char[] { ' ' });

var results = target.GroupBy(t => new
{
    str = t,
    count = target.Count(sub => sub.Equals(t))
});

Answer 1

正如你现在所做的那样，它会起作用（在某种程度上），但是非常低效。因此，结果是分组的枚举，而不是您可能想到的（单词，计数）对。

GroupBy()的重载需要一个函数来选择密钥。您正在为集合中的每个项目有效地执行该计算。如果没有使用忽略标点符号的正则表达式的路线，它应该这样写：

string src = "for each character in the string, take the rest of the " +
             "string starting from that character " +
             "as a substring; count it if it starts with the target string";

var results = src.Split()               // default split by whitespace
                 .GroupBy(str => str)   // group words by the value
                 .Select(g => new
                              {
                                  str = g.Key,      // the value
                                  count = g.Count() // the count of that value
                              });

// sort the results by the words that were counted
var sortedResults = results.OrderByDescending(p => p.str);

Answer 2

虽然慢了3-4倍，但Regex方法可以说更准确：

string src = "for each character in the string, take the rest of the " +
    "string starting from that character " +
    "as a substring; count it if it starts with the target string";

var regex=new Regex(@"\w+",RegexOptions.Compiled);
var sw=new Stopwatch();

for (int i = 0; i < 100000; i++)
{
    var dic=regex
        .Matches(src)
        .Cast<Match>()
        .Select(m=>m.Value)
        .GroupBy(s=>s)
        .ToDictionary(g=>g.Key,g=>g.Count());
    if(i==1000)sw.Start();
}
Console.WriteLine(sw.Elapsed);

sw.Reset();

for (int i = 0; i < 100000; i++)
{
    var dic=src
        .Split(' ')
        .GroupBy(s=>s)
        .ToDictionary(g=>g.Key,g=>g.Count());
    if(i==1000)sw.Start();
}
Console.WriteLine(sw.Elapsed);

例如，正则表达式方法不会将string和string,计为两个单独的条目，并且会正确标记substring而不是substring;。

修改

阅读上一个问题并了解我的代码并不完全符合您的规范。无论如何，它仍然展示了使用Regex的优势/成本。

Answer 3

这是一个没有ToDictionary()的LINQ版本，根据您的需要可能会增加不必要的开销......

var dic = src.Split(' ').GroupBy(s => s, (str, g) => new { str, count = g.Count() });

或者在查询语法中......

var dic = from str in src.Split(' ')
          group str by str into g
          select new { str, count = g.Count() };

Answer 4

摆脱String.Split并没有留下很多选项。一个选项是Regex.Matches spender demonstrated，另一个是Regex.Split（它不会给我们任何新内容）。

除了分组，您可以使用以下任一方法：

var target = src.Split(new[] { ' ', ',', ';' }, StringSplitOptions.RemoveEmptyEntries);
var result = target.Distinct()
                   .Select(s => new { Word = s, Count = target.Count(w => w == s) });

// or dictionary approach
var result = target.Distinct()
                   .ToDictionary(s => s, s => target.Count(w => w == s));

需要Distinct调用以避免重复项目。我继续前进并扩展角色以分开以获得没有标点符号的实际单词。我找到了第一种使用消费者基准测试代码最快的方法。

回到从您之前引用的问题中订购结果的要求，您可以轻松扩展第一种方法，如下所示：

var result = target.Distinct()
                   .Select(s => new { Word = s, Count = target.Count(w => w == s) })
                   .OrderByDescending(o => o.Count);

// or in query form

var result = from s in target.Distinct()
             let count = target.Count(w => w == s)
             orderby count descending
             select new { Word = s, Count = count };

编辑：由于匿名类型即将到来，摆脱了元组。

最小化LINQ字符串标记计数器

4 个答案: