如何知道段落中出现的单词最多? (Matlab的)

时间:2012-11-27 20:21:53

标签: string matlab octave

我有一个很大的段落,想知道哪个词出现最多。有人可以指点我正确的方向吗?任何例子和解释都会有所帮助。谢谢!

2 个答案:

答案 0 :(得分:5)

这是一种非常MATLAB-y的方法。我试着清楚地命名变量。玩每一行并检查结果以了解其工作原理。主力函数:uniquehist

% First produce a cell array of words to be analyzed
paragraph_cleaned_up_whitespace = regexprep(paragraph, '\s', ' ');
paragraph_cleaned_up = regexprep(paragraph_cleaned_up_whitespace, '[^a-zA-Z0-9 ]', '');
words = regexpi(paragraph_cleaned_up, '\s+', 'split');

[unique_words, i, j] = unique(words);
frequency_count = hist(j, 1:max(j));
[~, sorted_locations] = sort(frequency_count);
sorted_locations = fliplr(sorted_locations);
words_sorted_by_frequency = unique_words(sorted_locations).';
frequency_of_those_words = frequency_count(sorted_locations).';

答案 1 :(得分:2)

这是一个简单的解决方案,应该非常快。

example_paragraph = 'This is an example corpus. Is is a verb?';

words = regexp(example_paragraph, ' ', 'split');
vocabulary = unique(words);
n = length(vocabulary);
counts = zeros(n, 1);
for i=1:n
    counts(i) = sum(strcmpi(words, vocabulary{i}));
end

[frequency_of_the_most_frequent_word, idx] = max(counts);
most_frequent_word = vocabulary{idx};

您还可以查看答案here,以便从单词数组中获取最常用的单词。