Question

我有一个格式为

的文本文件

 gene            complement(22995..24539)
                 /gene="ppp"
                 /locus_tag="MRA_0020"
 CDS             complement(22995..24539)
                 /gene="ppp"
                 /locus_tag="MRA_0020"
                 /codon_start=1
                 /transl_table=11
                 /product="putative serine/threonine phosphatase Ppp"
                 /protein_id="ABQ71738.1"
                 /db_xref="GI:148503929"
 gene            complement(24628..25095)
                 /locus_tag="MRA_0021"
 CDS             complement(24628..25095)
                 /locus_tag="MRA_0021"
                 /codon_start=1
                 /transl_table=11
                 /product="hypothetical protein"
                 /protein_id="ABQ71739.1"
                 /db_xref="GI:148503930"
 gene            complement(25219..26802)
                 /locus_tag="MRA_0022"
 CDS             complement(25219..26802)
                 /locus_tag="MRA_0022"
                 /codon_start=1
                 /transl_table=11
                 /product="hypothetical protein"
                 /protein_id="ABQ71740.1"
                 /db_xref="GI:148503931"

我想将文本文件读入Matlab并制作一个列表，其中包含来自line gene的信息作为列表中每个项目的起点。因此，对于此示例，列表中将有3个项目。我尝试了一些东西，但是无法让它发挥作用。任何人对我能做什么有任何想法？

Answer 1

以下是算法的快速建议：

使用fopen
开始阅读fgetl行，直至找到以'CDS'开头的行。
保持阅读行，直到您获得以'gene'开头的另一行。
对于（2）和（3）中的行之间的所有行
- 找到'/'和'='之间的字符串。这是fieldname
- 找到引号之间的字符串。这是字段的值
将计数器加1，从＃2开始，直到您完成阅读文件

这些命令可能会有所帮助：

查找特定包含的字符串字符，例如使用 regexp(lineThatHasBeenRead,'/(.+)=','tokens','once')
创建输出结构，使用动态字段名称，例如 output(ct).(fieldname) = value;

修改

这是一些代码。我将你的例子保存为'test.txt'。

% open file fid = fopen('test.txt'); % parse the file eof = false; geneCt = 1; clear output % you cannot reassign output if it exists with different fieldnames already output(1:1000) = struct; % you may want to initialize fields here while ~eof % read lines till we find one with CDS foundCDS = false; while ~foundCDS currentLine = fgetl(fid); % check for eof, then CDS. Allow whitespace at the beginning if currentLine == -1 % end of file eof = true; elseif ~isempty(regexp(currentLine,'^\s+CDS','match','once')) foundCDS = true; end end % looking for CDS if ~eof % read (and remember) lines till we find 'gene' collectedLines = cell(1,20); % assume no more than 20 lines pere gene. Row vector for looping below foundGene = false; lineCt = 1; while ~foundGene currentLine = fgetl(fid); % check for eof, then gene. Allow whitespace at the beginning if currentLine == -1; % end of file - consider all data has been read eof = true; foundGene = true; elseif ~isempty(regexp(currentLine,'^\s+gene','match','once')) foundGene = true; else collectedLines{lineCt} = currentLine; lineCt = lineCt + 1; end end % loop through collectedLines and assign. Do not loop through the % gene line for line = collectedLines(1:lineCt-1) fieldname = regexp(line{1},'/(.+)=','tokens','once'); value = regexp(line{1},'="?([^"]+)"?$','tokens','once'); % try converting value to number numValue = str2double(value); if isfinite(numValue) value = numValue; else value = value{1}; end output(geneCt).(fieldname{1}) = value; end geneCt = geneCt + 1; end end % while eof % cleanup fclose(fid); output(geneCt:end) = [];

如何将文本文件读入matlab并将其作为列表？

1 个答案: