假设我们想通过MATLAB检查特定文本文件中出现的单词的次数,我们该怎么做? 现在,由于我正在检查单词是垃圾邮件单词还是HAM单词(进行内容过滤),我希望找到单词是垃圾邮件的概率,即n(垃圾邮件数量)发生)/ n(总发生次数)会给出概率。
提示?
答案 0 :(得分:3)
例如,考虑一个名为text.txt
的文本文件,其中包含以下文本:
这两句话与所有句子一样,都包含单词。其中一些词语重复出现;但不是全部。
可能的方法如下:
s = importdata('text.txt'); %// import text. Gives a 1x1 cell containing a string
words = regexp([lower(s{1}) '.'], '[\s\.,;:-''"?!/()]+', 'split'); %// split
%// into words. Make sure there's always at least a final punctuation sign.
%// You may want to extend the list of separators (between the brackets)
%// I have made this case insensitive using "lower"
words = words(1:end-1); %// remove last "word", which will always be empty
[uniqueWords, ~, intLabels] = unique(words); %// this is the important part:
%// get unique words and an integer label for each one
count = histc(intLabels, 1:numel(uniqueWords)); %// occurrences of each label
结果为uniqueWords
和count
:
uniqueWords =
'all' 'are' 'but' 'contain' 'like' 'not' 'of' 'repeated'
'sentences' 'some' 'these' 'those' 'two' 'words'
count =
2 1 1 1 1 1 1 1 2 1 1 1 1 2
答案 1 :(得分:0)
可以使用正则表达式来查找单词的出现次数..
例如:
txt = fileread( fileName );
tokens = regexp( txt, string, 'tokens' );
字符串是您要搜索的字符串..