对an earlier question的回答进行跟进。
有没有办法进一步减少这种情况,避免外部String.Split
电话?目标是{token, count}
的关联容器。
string src = "for each character in the string, take the rest of the " +
"string starting from that character " +
"as a substring; count it if it starts with the target string";
string[] target = src.Split(new char[] { ' ' });
var results = target.GroupBy(t => new
{
str = t,
count = target.Count(sub => sub.Equals(t))
});
答案 0 :(得分:4)
正如你现在所做的那样,它会起作用(在某种程度上),但是非常低效。因此,结果是分组的枚举,而不是您可能想到的(单词,计数)对。
GroupBy()
的重载需要一个函数来选择密钥。您正在为集合中的每个项目有效地执行该计算。如果没有使用忽略标点符号的正则表达式的路线,它应该这样写:
string src = "for each character in the string, take the rest of the " +
"string starting from that character " +
"as a substring; count it if it starts with the target string";
var results = src.Split() // default split by whitespace
.GroupBy(str => str) // group words by the value
.Select(g => new
{
str = g.Key, // the value
count = g.Count() // the count of that value
});
// sort the results by the words that were counted
var sortedResults = results.OrderByDescending(p => p.str);
答案 1 :(得分:3)
虽然慢了3-4倍,但Regex方法可以说更准确:
string src = "for each character in the string, take the rest of the " +
"string starting from that character " +
"as a substring; count it if it starts with the target string";
var regex=new Regex(@"\w+",RegexOptions.Compiled);
var sw=new Stopwatch();
for (int i = 0; i < 100000; i++)
{
var dic=regex
.Matches(src)
.Cast<Match>()
.Select(m=>m.Value)
.GroupBy(s=>s)
.ToDictionary(g=>g.Key,g=>g.Count());
if(i==1000)sw.Start();
}
Console.WriteLine(sw.Elapsed);
sw.Reset();
for (int i = 0; i < 100000; i++)
{
var dic=src
.Split(' ')
.GroupBy(s=>s)
.ToDictionary(g=>g.Key,g=>g.Count());
if(i==1000)sw.Start();
}
Console.WriteLine(sw.Elapsed);
例如,正则表达式方法不会将string
和string,
计为两个单独的条目,并且会正确标记substring
而不是substring;
。
修改
阅读上一个问题并了解我的代码并不完全符合您的规范。无论如何,它仍然展示了使用Regex的优势/成本。
答案 2 :(得分:1)
这是一个没有ToDictionary()
的LINQ版本,根据您的需要可能会增加不必要的开销......
var dic = src.Split(' ').GroupBy(s => s, (str, g) => new { str, count = g.Count() });
或者在查询语法中......
var dic = from str in src.Split(' ')
group str by str into g
select new { str, count = g.Count() };
答案 3 :(得分:1)
摆脱String.Split
并没有留下很多选项。一个选项是Regex.Matches
spender demonstrated,另一个是Regex.Split
(它不会给我们任何新内容)。
除了分组,您可以使用以下任一方法:
var target = src.Split(new[] { ' ', ',', ';' }, StringSplitOptions.RemoveEmptyEntries);
var result = target.Distinct()
.Select(s => new { Word = s, Count = target.Count(w => w == s) });
// or dictionary approach
var result = target.Distinct()
.ToDictionary(s => s, s => target.Count(w => w == s));
需要Distinct
调用以避免重复项目。我继续前进并扩展角色以分开以获得没有标点符号的实际单词。我找到了第一种使用消费者基准测试代码最快的方法。
回到从您之前引用的问题中订购结果的要求,您可以轻松扩展第一种方法,如下所示:
var result = target.Distinct()
.Select(s => new { Word = s, Count = target.Count(w => w == s) })
.OrderByDescending(o => o.Count);
// or in query form
var result = from s in target.Distinct()
let count = target.Count(w => w == s)
orderby count descending
select new { Word = s, Count = count };
编辑:由于匿名类型即将到来,摆脱了元组。