我有一个文本文件存储为字符串变量。处理文本文件,使其仅包含小写单词和空格。现在,假设我有一个静态字典,它只是一个特定单词列表,我想从文本文件中计算字典中每个单词的频率。例如:
Text file:
i love love vb development although i m a total newbie
Dictionary:
love, development, fire, stone
我希望看到的输出类似于以下内容,列出字典单词及其计数。如果它使编码更简单,它也只能列出文本中出现的字典单词。
===========
WORD, COUNT
love, 2
development, 1
fire, 0
stone, 0
============
使用正则表达式(例如“\ w +”)我可以获得所有单词匹配,但我不知道如何获得也在字典中的计数,所以我被卡住了。效率在这里至关重要,因为字典非常大(~100,000个单词),文本文件也不小(每个~200kb)。
我感谢任何帮助。
答案 0 :(得分:6)
您可以通过将字符串分组并将其转换为字典来计算字符串中的单词:
Dictionary<string, int> count =
theString.Split(' ')
.GroupBy(s => s)
.ToDictionary(g => g.Key, g => g.Count());
现在你可以检查词典中是否存在单词,如果出现则显示计数。
答案 1 :(得分:5)
var dict = new Dictionary<string, int>();
foreach (var word in file)
if (dict.ContainsKey(word))
dict[word]++;
else
dict[word] = 1;
答案 2 :(得分:0)
使用Groovy regex facilty,我会这样做: -
def input="""
i love love vb development although i m a total newbie
"""
def dictionary=["love", "development", "fire", "stone"]
dictionary.each{
def pattern= ~/${it}/
match = input =~ pattern
println "${it}" + "-"+ match.count
}
答案 3 :(得分:0)
试试这个。单词变量显然是您的文本字符串。关键字数组是您要计算的关键字列表。
对于不在文本中的字典单词,这不会返回0,但是您指定此行为是可以的。这可以在满足您的应用程序要求的同时为您提供相对良好的性能。
string words = "i love love vb development although i m a total newbie";
string[] keywords = new[] { "love", "development", "fire", "stone" };
Regex regex = new Regex("\\w+");
var frequencyList = regex.Matches(words)
.Cast<Match>()
.Select(c => c.Value.ToLowerInvariant())
.Where(c => keywords.Contains(c))
.GroupBy(c => c)
.Select(g => new { Word = g.Key, Count = g.Count() })
.OrderByDescending(g => g.Count)
.ThenBy(g => g.Word);
//Convert to a dictionary
Dictionary<string, int> dict = frequencyList.ToDictionary(d => d.Word, d => d.Count);
//Or iterate through them as is
foreach (var item in frequencyList)
Response.Write(String.Format("{0}, {1}", item.Word, item.Count));
如果你想在不使用RegEx的情况下实现同样的目的,因为你表示你知道一切都是小写并用空格分隔,你可以像这样修改上面的代码:
string words = "i love love vb development although i m a total newbie";
string[] keywords = new[] { "love", "development", "fire", "stone" };
var frequencyList = words.Split(' ')
.Select(c => c)
.Where(c => keywords.Contains(c))
.GroupBy(c => c)
.Select(g => new { Word = g.Key, Count = g.Count() })
.OrderByDescending(g => g.Count)
.ThenBy(g => g.Word);
Dictionary<string, int> dict = frequencyList.ToDictionary(d => d.Word, d => d.Count);