<s> an evolutionary immune network for data clustering </s>
<s> an evolutionary immune network for data clustering </s>
<s> inet an extensible framework for simulating immune network </s>
<s> immunity based systems a survey </s>
<s> a recommender system based on the immune network </s>
我在MATLAB工作,这些句子来自文本文件,我想逐行阅读这些句子,并想要提取每个单词以及计算每个单词的频率。如何使用“regexp”功能提取单词?
答案 0 :(得分:0)
</s><s>
被认为是一个单词的原因是您已经阅读了整个文件,并且只是在空格上分割,而不是换行和空格。
相反,使用fgets
逐行读取文件并单独拆分行,随时增加令牌数。
答案 1 :(得分:0)
我认为字符串'<s></s>'
确实出现在文本文件的某个地方。如果是这种情况,拆分空间当然是不够的;您必须返回所有'<s>'
,'</s>'
或连续的非空格字符:
regexp(F, '<s>|\w*|</s>', 'match');
完整代码:
% Read file contents
fid = fopen('test.txt','r');
F = fread(fid, '*char').';
fclose(fid);
% Split all words
C = regexp(F, '<s>|\w*|</s>', 'match');
% Find word frequencies
words = unique(C);
counts = cellfun(@(x)sum(strcmp(x,C)), words);
% Group them together for display
freq = [num2cell(counts.') words.']