Question

我有一个文本文件存储为字符串变量。处理文本文件，使其仅包含小写单词和空格。现在，假设我有一个静态字典，它只是一个特定单词列表，我想从文本文件中计算字典中每个单词的频率。例如：

Text file:

i love love vb development although i m a total newbie

Dictionary:

love, development, fire, stone

我希望看到的输出类似于以下内容，列出字典单词及其计数。如果它使编码更简单，它也只能列出文本中出现的字典单词。

===========

WORD, COUNT

love, 2

development, 1

fire, 0

stone, 0

============

使用正则表达式（例如“\ w +”）我可以获得所有单词匹配，但我不知道如何获得也在字典中的计数，所以我被卡住了。效率在这里至关重要，因为字典非常大（~100,000个单词），文本文件也不小（每个~200kb）。

我感谢任何帮助。

Answer 1

您可以通过将字符串分组并将其转换为字典来计算字符串中的单词：

Dictionary<string, int> count =
  theString.Split(' ')
  .GroupBy(s => s)
  .ToDictionary(g => g.Key, g => g.Count());

现在你可以检查词典中是否存在单词，如果出现则显示计数。

Answer 2

var dict = new Dictionary<string, int>();

foreach (var word in file)
  if (dict.ContainsKey(word))
    dict[word]++;
  else
    dict[word] = 1;

Answer 3

使用Groovy regex facilty，我会这样做： -

def input="""
    i love love vb development although i m a total newbie
"""

def dictionary=["love", "development", "fire", "stone"]


dictionary.each{
    def pattern= ~/${it}/
    match = input =~ pattern
    println "${it}" + "-"+ match.count
}

Answer 4

试试这个。单词变量显然是您的文本字符串。关键字数组是您要计算的关键字列表。

对于不在文本中的字典单词，这不会返回0，但是您指定此行为是可以的。这可以在满足您的应用程序要求的同时为您提供相对良好的性能。

string words = "i love love vb development although i m a total newbie";
string[] keywords = new[] { "love", "development", "fire", "stone" };

Regex regex = new Regex("\\w+");

var frequencyList = regex.Matches(words)
    .Cast<Match>()
    .Select(c => c.Value.ToLowerInvariant())
    .Where(c => keywords.Contains(c))
    .GroupBy(c => c)
    .Select(g => new { Word = g.Key, Count = g.Count() })
    .OrderByDescending(g => g.Count)
    .ThenBy(g => g.Word);

//Convert to a dictionary
Dictionary<string, int> dict = frequencyList.ToDictionary(d => d.Word, d => d.Count);

//Or iterate through them as is
foreach (var item in frequencyList)
    Response.Write(String.Format("{0}, {1}", item.Word, item.Count));

如果你想在不使用RegEx的情况下实现同样的目的，因为你表示你知道一切都是小写并用空格分隔，你可以像这样修改上面的代码：

string words = "i love love vb development although i m a total newbie";
string[] keywords = new[] { "love", "development", "fire", "stone" };

var frequencyList = words.Split(' ')
    .Select(c => c)
    .Where(c => keywords.Contains(c))
    .GroupBy(c => c)
    .Select(g => new { Word = g.Key, Count = g.Count() })
    .OrderByDescending(g => g.Count)
    .ThenBy(g => g.Word);

Dictionary<string, int> dict = frequencyList.ToDictionary(d => d.Word, d => d.Count);

计算文本文件中特定单词的频率

4 个答案: