Question

我正在尝试阅读一些文本文件并将术语频率存储在矩阵中，其中每行应该是每个文档，因此每行中的每列都是每个术语的权重。然后，当矩阵完成时，矩阵中的每列应对应于特定术语。到目前为止，我已经设法做到这一点，但是我的问题是，这仅在文档的尺寸相同时才有效，即每个文档中的单词数量相同。

代码如下：

F = dir('*.txt');
s = [];
allWords = [];
for ii = 1:length(F) 
        filetext = fileread(F(ii).name);
        filetext = split(filetext);
        filetext = filetext';
        filetext = sort(filetext);
        allWords = [allWords, filetext];
        A = unique(allWords);
        l = length(A);
        s = [s; filetext];
end
%A = unique(allWords);
%Each row in D is one document
D = s;
out=zeros(size(D));
for k=1:numel(A)
  idx=ismember(D,A(k));
  out(:,k)=sum(idx,2);
end
disp(out)
%Matrix is as follows
%The terms are sorted as well
%0     0     1     1     1     1     1
%1     1     0     1     1     1     0
%Where I have the files example.txt and example2.txt with the contents
%we see the shining sun
%the sun is shining bright

当文档具有不同的维度时，我仍然可以在保留“术语结构”的同时扩展矩阵，当然这是现实世界的场景。

我无法访问具有内置功能的文本分析工具箱。

在MATLAB

0 个答案: