Question

我希望帮助同时读取所有文本文件，并将要存储的文本拆分为数组。我试过这个但是没能这样做。发生的主要问题是即使使用for循环来读取文本文件，strsplit也只会拆分一个文本文件。如何将所有这些分成一个不同的数组，意味着一个文本文件的数组。下面是目前为止的代码 -

for i = 1:10
file = [num2str(i) '.eng'];
% load string from a file

STR = importdata(file);

% extract string between tags
B = regexprep(STR, '<.*?>','');

% split each string by delimiters and add to C
C = [];
for j=1:length(B)
    if ~isempty(B{j})
        C = [C strsplit(B{j}, {'/', ' '})];
    end
end

以下是文本文件示例---

<DOC>

<DOCNO>annotations/01/1515.eng</DOCNO>

<TITLE>Yacare Ibera</TITLE>

<DESCRIPTION>an alligator in the water;</DESCRIPTION>

<NOTES></NOTES>

<LOCATION>Corrientes, Argentina</LOCATION>

<DATE>August 2002</DATE>

<IMAGE>images/01/1515.jpg</IMAGE>

<THUMBNAIL>thumbnails/01/1515.jpg</THUMBNAIL>

</DOC>

Answer 1

假设您正在寻找“鳄鱼”这个词。然后你可以做以下

clc

word = 'alligator';

num_of_files = 10;

C = cell(num_of_files, 1);

for i = 1:10

    file = [num2str(i) '.eng'];
    %// load string from a file
    STR = importdata(file);

    %// extract string between tags
    %// assuming you want to remove the angle brackets
    B = regexprep(STR, '<.*?>','');
    B(strcmp(B, '')) = [];

    %// split each string by delimiters and add to C    

    tmp = regexp(B, '/| ', 'split');
    C{i} = [tmp{:}];

end

where = [];

for j = 1:length(C)

    if find(strcmp(C{j}, word))

        where = [where num2str(j) '.eng, '];

    end

end

if length(where) == 0

    disp(['No file contains the word ' word '.'])

else

    where(end-1:end) = [];
    disp(['The word ' word ' is contained in: ' where])

end

因为我使用了10份你的文件，所以每一个都有“鳄鱼”这个词，所以我得到了

鳄鱼这个词包含在：1.eng，2.eng，3.eng，4.eng，5.eng， 6.eng，7.eng，8.eng，9.eng，10.eng

然而，如果我设置word = 'cohomology'，则输出为

没有文件包含上同调一词。

同时从多个文本文件中读取文本并将其拆分为单词数组

1 个答案: