在matlab中为NLP构建字典(字数)的最佳方法是什么?

时间:2016-10-27 13:01:25

标签: matlab dictionary nlp

我有一个频率计数字典,我希望能够在我的dictonary中读取给定单词的频率计数。

例如,

我的输入词是'about',所以输出将是我字典中'about'的计数,其中139可以计算出概率。

  139 about
  133 according
  163 accusing
  244 actually
  567 afternoon
  175 again
  156 ah
  167 a-ha
  165 ahh

我尝试用fopen方法做这个,但没有得到想要的结果。

1 fid = fopen('dictionary.txt');
2 words = textscan(fid, '%s');
3 fclose(fid);
4 words = words{1};

我也尝试了这个,但得到了不同的结果,

countfunction = @(word) nnz(strcmp(word, words));
count = cellfun(countfunction, words);
tally = [words num2cell(count)];
sortrows(tally, 2);

1 个答案:

答案 0 :(得分:0)

问题在于,您正在为字典中每个单词的每个实例运行countfunction,而不是字典中的每个唯一单词。

以下是如何逐步改进代码:

words = {'hi' 'hi' 'the' 'hi' 'the' 'a'};
unique_words = unique(words(:));
countfunction = @(word) nnz(strcmp(word, words));
count = cellfun(countfunction, unique_words);
tally = [unique_words, num2cell(count)];
disp(sortrows(tally, 2));
    'a'      [1]
    'the'    [2]
    'hi'     [3]

但是,我建议改用grpstats:

words = {'hi' 'hi' 'the' 'hi' 'the' 'a'};
[unique_words, count] = grpstats(ones(size(words)), words(:), {'gname', 'numel'});
tally = [unique_words, num2cell(count)];
disp(sortrows(tally, 2));
    'a'      [1]
    'the'    [2]
    'hi'     [3]