如何使用MATLAB将每个空行上的大文本文件拆分成较小的文本文件?

时间:2016-10-02 14:21:58

标签: matlab split text-files

我有一个大文本文件,如下所示:

PMID- 123456123
OWN - NLM
DA  - 20160930

PMID- 27689094
OWN - NLM
VI  - 2016
DP  - 2016

PMID- 27688828
OWN - NLM
STAT- Publisher
DA  - 20160930
LR  - 20160930

依旧...... 我想根据每个空白行将文本文件拆分为较小的文本文件。同时命名与其PMID号对应的每个文本文件,如下所示:

filename' 123456123.txt'包含:

PMID- 123456123
OWN - NLM
DA  - 20160930

filename' 27689094.txt'包含:

PMID- 27689094
OWN - NLM
VI  - 2016
DP  - 2016

filename' 27688828.txt'包含:

PMID- 27688828
OWN - NLM
STAT- Publisher
DA  - 20160930
LR  - 20160930

这是我的尝试,我知道如何识别空行(我认为),但我不知道如何拆分并保存为较小的文本文件:

fid = fopen(filename);
text = fgets(fid);
blankline = sprintf('\r\n');

while ischar(text)
    if strcmp(blankline,str)
        %split the text
    else
        %write the text to the smaller file
    end
end

1 个答案:

答案 0 :(得分:2)

您可以读取整个文件,然后使用regexp将内容拆分为空行。然后,您可以再次使用regexp提取每个组的PMID,然后遍历所有部分并保存它们。将文件处理为像这样的一个巨大的字符串可能比使用fgets逐个读取它更有效。

% Tell it what folder you want to put the files in
outdir = '/my/folder';

% Read the initial file in all at once
fid = fopen(filename, 'r');
data = fread(fid, '*char').';
fclose(fid);

% Break it into pieces based upon empty lines
pieces = regexp(data, '\n\s*\n', 'split');

% For each piece get the PMID
pmids = regexp(pieces, '(?<=PMID-\s*)\d*', 'match', 'once');

% Now loop through and save each one
for k = 1:numel(pieces)
    % Use the PMID of this piece to construct a filename
    filename = fullfile(outdir, [pmids{k}, '.txt']);

    % Now write the piece to the file
    fid = fopen(filename, 'w');
    fwrite(fid, pieces{k});
    fclose(fid);
end