从.NET中的文本中提取关键字

时间:2010-10-27 16:44:08

标签: .net linq search sorting keyword

我需要计算每个关键字在字符串中重复出现的次数,并按最高数字排序。 为此目的,.NET代码中最快的算法是什么?

4 个答案:

答案 0 :(得分:6)

编辑:以下代码对具有计数的唯一令牌进行分组

string[] target = src.Split(new char[] { ' ' });

var results = target.GroupBy(t => new
{
    str = t,
    count = target.Count(sub => sub.Equals(t))
});

这终于开始让我更有意义了......

编辑:以下代码导致计数与目标子字符串相关:

string src = "for each character in the string, take the rest of the " +
    "string starting from that character " +
    "as a substring; count it if it starts with the target string";
string[] target = {"string", "the", "in"};

var results = target.Select((t, index) => new {str = t, 
    count = src.Select((c, i) => src.Substring(i)).
    Count(sub => sub.StartsWith(t))});

结果现在是:

+       [0] { str = "string", count = 4 }   <Anonymous Type>
+       [1] { str = "the", count = 4 }  <Anonymous Type>
+       [2] { str = "in", count = 6 }   <Anonymous Type>

以下原始代码:

string src = "for each character in the string, take the rest of the " +
    "string starting from that character " +
    "as a substring; count it if it starts with the target string";
string[] target = {"string", "the", "in"};

var results = target.Select(t => src.Select((c, i) => src.Substring(i)).
    Count(sub => sub.StartsWith(t))).OrderByDescending(t => t);

感谢this previous response

调试器的结果(需要额外的逻辑来包含匹配的字符串及其计数):

-       results {System.Linq.OrderedEnumerable<int,int>}    
-       Results View    Expanding the Results View will enumerate the IEnumerable   
        [0] 6   int
        [1] 4   int
        [2] 4   int

答案 1 :(得分:4)

Dunno谈得最快,但Linq可能是最容易理解的:

var myListOfKeywords = new [] {"struct", "public", ...};

var keywordCount = from keyword in myProgramText.Split(new []{" ","(", ...})
   group by keyword into g
   where myListOfKeywords.Contains(g.Key)
   select new {g.Key, g.Count()}

foreach(var element in keywordCount)
   Console.WriteLine(String.Format("Keyword: {0}, Count: {1}", element.Key, element.Count));

你可以用非Linq-y方式写这个,但基本前提是一样的;将字符串拆分为单词,并计算每个感兴趣的单词的出现次数。

答案 2 :(得分:2)

简单算法:将字符串拆分为单词数组,遍历此数组,并将每个单词的计数存储在哈希表中。完成后按计数排序。

答案 3 :(得分:1)

您可以将字符串分解为字符串集合,每个字对应一个字符串,然后对集合执行LINQ查询。虽然我怀疑它会是最快的,但它可能比正则表达式更快。<​​/ p>