MATLAB - 如何获取字符串中每个单词的出现次数?

时间:2014-08-28 19:43:43

标签: string matlab

假设我们想通过MATLAB检查特定文本文件中出现的单词的次数,我们该怎么做? 现在,由于我正在检查单词是垃圾邮件单词还是HAM单词(进行内容过滤),我希望找到单词是垃圾邮件的概率,即n(垃圾邮件数量)发生)/ n(总发生次数)会给出概率。

提示?

2 个答案:

答案 0 :(得分:3)

例如,考虑一个名为text.txt的文本文件,其中包含以下文本:

  

这两句话与所有句子一样,都包含单词。其中一些词语重复出现;但不是全部。

可能的方法如下:

s = importdata('text.txt'); %// import text. Gives a 1x1 cell containing a string
words = regexp([lower(s{1}) '.'], '[\s\.,;:-''"?!/()]+', 'split'); %// split 
%// into words. Make sure there's always at least a final punctuation sign.
%// You may want to extend the list of separators (between the brackets)
%// I have made this case insensitive using "lower"
words = words(1:end-1); %// remove last "word", which will always be empty
[uniqueWords, ~, intLabels] = unique(words); %// this is the important part:
%// get unique words and an integer label for each one
count = histc(intLabels, 1:numel(uniqueWords)); %// occurrences of each label

结果为uniqueWordscount

uniqueWords = 
    'all'    'are'    'but'    'contain'    'like'    'not'    'of'    'repeated'
    'sentences'    'some'    'these'    'those'    'two'    'words'    

count =
      2    1    1    1    1    1    1    1    2    1    1    1    1    2

答案 1 :(得分:0)

可以使用正则表达式来查找单词的出现次数..

例如:

txt = fileread( fileName );
tokens = regexp( txt, string, 'tokens' );

字符串是您要搜索的字符串..