Question

假设我们想通过MATLAB检查特定文本文件中出现的单词的次数，我们该怎么做？现在，由于我正在检查单词是垃圾邮件单词还是HAM单词（进行内容过滤），我希望找到单词是垃圾邮件的概率，即n（垃圾邮件数量）发生）/ n（总发生次数）会给出概率。

提示？

Answer 1

例如，考虑一个名为text.txt的文本文件，其中包含以下文本：

这两句话与所有句子一样，都包含单词。其中一些词语重复出现;但不是全部。

可能的方法如下：

s = importdata('text.txt'); %// import text. Gives a 1x1 cell containing a string
words = regexp([lower(s{1}) '.'], '[\s\.,;:-''"?!/()]+', 'split'); %// split 
%// into words. Make sure there's always at least a final punctuation sign.
%// You may want to extend the list of separators (between the brackets)
%// I have made this case insensitive using "lower"
words = words(1:end-1); %// remove last "word", which will always be empty
[uniqueWords, ~, intLabels] = unique(words); %// this is the important part:
%// get unique words and an integer label for each one
count = histc(intLabels, 1:numel(uniqueWords)); %// occurrences of each label

结果为uniqueWords和count：

uniqueWords = 
    'all'    'are'    'but'    'contain'    'like'    'not'    'of'    'repeated'
    'sentences'    'some'    'these'    'those'    'two'    'words'    

count =
      2    1    1    1    1    1    1    1    2    1    1    1    1    2

Answer 2

可以使用正则表达式来查找单词的出现次数..

例如：

txt = fileread( fileName );
tokens = regexp( txt, string, 'tokens' );

字符串是您要搜索的字符串..

MATLAB - 如何获取字符串中每个单词的出现次数？

2 个答案: