Question

我有一个频率计数字典，我希望能够在我的dictonary中读取给定单词的频率计数。

例如，

我的输入词是'about'，所以输出将是我字典中'about'的计数，其中139可以计算出概率。

  139 about
  133 according
  163 accusing
  244 actually
  567 afternoon
  175 again
  156 ah
  167 a-ha
  165 ahh

我尝试用fopen方法做这个，但没有得到想要的结果。

1 fid = fopen('dictionary.txt');
2 words = textscan(fid, '%s');
3 fclose(fid);
4 words = words{1};

我也尝试了这个，但得到了不同的结果，

countfunction = @(word) nnz(strcmp(word, words));
count = cellfun(countfunction, words);
tally = [words num2cell(count)];
sortrows(tally, 2);

Answer 1

问题在于，您正在为字典中每个单词的每个实例运行countfunction，而不是字典中的每个唯一单词。

以下是如何逐步改进代码：

words = {'hi' 'hi' 'the' 'hi' 'the' 'a'};
unique_words = unique(words(:));
countfunction = @(word) nnz(strcmp(word, words));
count = cellfun(countfunction, unique_words);
tally = [unique_words, num2cell(count)];
disp(sortrows(tally, 2));
    'a'      [1]
    'the'    [2]
    'hi'     [3]

但是，我建议改用grpstats：

words = {'hi' 'hi' 'the' 'hi' 'the' 'a'};
[unique_words, count] = grpstats(ones(size(words)), words(:), {'gname', 'numel'});
tally = [unique_words, num2cell(count)];
disp(sortrows(tally, 2));
    'a'      [1]
    'the'    [2]
    'hi'     [3]

在matlab中为NLP构建字典（字数）的最佳方法是什么？

1 个答案: