Question

<s> an evolutionary immune network for data clustering </s>
<s> an evolutionary immune network for data clustering </s>
<s> inet an extensible framework for simulating immune network </s>
<s> immunity based systems a survey </s>
<s> a recommender system based on the immune network </s>

我在MATLAB工作，这些句子来自文本文件，我想逐行阅读这些句子，并想要提取每个单词以及计算每个单词的频率。如何使用“regexp”功能提取单词？

Answer 1

</s><s>被认为是一个单词的原因是您已经阅读了整个文件，并且只是在空格上分割，而不是换行和空格。

相反，使用fgets逐行读取文件并单独拆分行，随时增加令牌数。

Answer 2

我认为字符串'<s></s>'确实出现在文本文件的某个地方。如果是这种情况，拆分空间当然是不够的;您必须返回所有'<s>'，'</s>'或连续的非空格字符：

regexp(F, '<s>|\w*|</s>', 'match');

完整代码：

% Read file contents
fid = fopen('test.txt','r');
F = fread(fid, '*char').';
fclose(fid);

% Split all words
C = regexp(F, '<s>|\w*|</s>', 'match');

% Find word frequencies
words  = unique(C);
counts = cellfun(@(x)sum(strcmp(x,C)), words);

% Group them together for display
freq = [num2cell(counts.') words.']

写regexp从matlab中的文本文件中读取句子

2 个答案: