Question

我有一个文件，其中包含一些已转录为语音识别程序的句子的完整值。我一直在尝试编写一些matlab代码来浏览这个文件并提取每个句子的值并将它们写入一个新的单个文件。因此，不要将它们全部放在一个'mlf'文件中，而是将它们放在每个句子的单独文件中。

例如，'mlf'文件（包含所有句子的所有值）如下所示：

#!MLF!#
"/N001.lab"
AH
SEE
I
GOT
THEM
MONTHS
AGO
.
"/N002.lab"
WELL
WORK
FOR
LIVE
WIRE
BUT
ERM
.
"/N003.lab"
IM
GOING
TO
SEE
JAMES
VINCENT
MCMORROW
.
etc

所以每个句子都用'Nxxx.lab'和'。'分隔。我需要为每个Nxxx.lab创建一个新文件，例如N001的文件只包含：

AH
SEE
I
GOT
THEM
MONTHS
AGO

我一直在尝试使用fgetline指定'Nxxx.lab'和'。'边界，但它不起作用，因为我不知道如何将内容写入与'mlf'分开的新文件。

如果有人能够给我任何使用方法的指导，我将不胜感激！

干杯！

Answer 1

尝试此代码（输入文件test.mlf必须位于工作目录中）：

%# read the file
filename = 'test.mlf';
fid = fopen(filename,'r');
lines = textscan(fid,'%s','Delimiter','\n','HeaderLines',1);
lines = lines{1};
fclose(fid);

%# find start and stop indices
istart = find(cellfun(@(x) strcmp(x(1),'"'), lines));
istop = find(strcmp(lines, '.'));
assert(numel(istop)==numel(istop) && all(istop>istart),'Check the input file format.')

%# write lines to new files
for k = 1:numel(istart)
    filenew = lines{istart(k)}(2:end-1);
    fout = fopen(filenew,'wt');
    for l = (istart(k)+1):(istop(k)-1)
        fprintf(fout,'%s\n',lines{l});
    end
    fclose(fout);
end

代码假设文件名是双引号，如示例中所示。如果没有，您可以根据模式找到istart个索引。或者只是假设新文件的条目从第2行开始并跟随点：istart = [1; istop(1:end-1)+1];

Answer 2

您可以使用不断增长的单元格数据来收集信息。

从文件中一次读取一行。

抓取文件名并将其放入第一列，如果它是第一个读取的句子。

如果读取的行是句点，则将其添加到字符串并将索引移动到数组中的行。使用内容编写新文件。

这段代码可以帮助您构建单元格数组，同时在其中附加一个字符串。我假设逐行阅读不是问题。您还可以在字符串（'\ n'）中保留回车符/换行符。

%% Declare A
A = {}

%% Fill row 1
A(1,1) = {'file1'}
A(1,2) = {'Sentence 1'}
A(1,2) = { strcat(A{1,2}, ', has been appended')}

%% Fill row 2
A(2,1) = {'file2'}
A(2,2) = {'Sentence 2'}

Answer 3

虽然我确信您可以使用MATLAB执行此操作，但我建议您使用Perl拆分原始文件，然后使用MATLAB处理各个文件。

以下Perl脚本读取整个文件（“xxx.txt”）并根据“NAME.lab”行写出各个文件：

open(my $fh, "<", "xxx.txt");

# read the entire file into $contents
# This may not be a good idea if the file is huge.
my $contents = do { local $/; <$fh> };

# iterate over the $contents string and extract the individual
# files
while($contents =~ /"(.*)"\n((.*\n)*?)\./mg) {

    # We arrive here with $1 holding the filename
    # and $2 the content up to the "." ending the section/sentence.
    open(my $fout, ">", $1);
    print $fout  $2;
    close($fout);
} 

close($fh);

多行正则表达式有点困难，但它确实起作用。对于这些类型的文本操作，perl更快更有用。如果您处理大量文本，这是一个很好的学习工具。

读取和写入新文件的文本 - Matlab

3 个答案: