我正在尝试阅读一些文本文件并将术语频率存储在矩阵中,其中每行应该是每个文档,因此每行中的每列都是每个术语的权重。然后,当矩阵完成时,矩阵中的每列应对应于特定术语。到目前为止,我已经设法做到这一点,但是我的问题是,这仅在文档的尺寸相同时才有效,即每个文档中的单词数量相同。
代码如下:
F = dir('*.txt');
s = [];
allWords = [];
for ii = 1:length(F)
filetext = fileread(F(ii).name);
filetext = split(filetext);
filetext = filetext';
filetext = sort(filetext);
allWords = [allWords, filetext];
A = unique(allWords);
l = length(A);
s = [s; filetext];
end
%A = unique(allWords);
%Each row in D is one document
D = s;
out=zeros(size(D));
for k=1:numel(A)
idx=ismember(D,A(k));
out(:,k)=sum(idx,2);
end
disp(out)
%Matrix is as follows
%The terms are sorted as well
%0 0 1 1 1 1 1
%1 1 0 1 1 1 0
%Where I have the files example.txt and example2.txt with the contents
%we see the shining sun
%the sun is shining bright
当文档具有不同的维度时,我仍然可以在保留“术语结构”的同时扩展矩阵,当然这是现实世界的场景。
我无法访问具有内置功能的文本分析工具箱。