我有一个格式为
的文本文件 gene complement(22995..24539)
/gene="ppp"
/locus_tag="MRA_0020"
CDS complement(22995..24539)
/gene="ppp"
/locus_tag="MRA_0020"
/codon_start=1
/transl_table=11
/product="putative serine/threonine phosphatase Ppp"
/protein_id="ABQ71738.1"
/db_xref="GI:148503929"
gene complement(24628..25095)
/locus_tag="MRA_0021"
CDS complement(24628..25095)
/locus_tag="MRA_0021"
/codon_start=1
/transl_table=11
/product="hypothetical protein"
/protein_id="ABQ71739.1"
/db_xref="GI:148503930"
gene complement(25219..26802)
/locus_tag="MRA_0022"
CDS complement(25219..26802)
/locus_tag="MRA_0022"
/codon_start=1
/transl_table=11
/product="hypothetical protein"
/protein_id="ABQ71740.1"
/db_xref="GI:148503931"
我想将文本文件读入Matlab并制作一个列表,其中包含来自line gene的信息作为列表中每个项目的起点。因此,对于此示例,列表中将有3个项目。我尝试了一些东西,但是无法让它发挥作用。任何人对我能做什么有任何想法?
答案 0 :(得分:2)
以下是算法的快速建议:
fopen
fgetl
行,直至找到以'CDS'
开头的行。'gene'
开头的另一行。'/'
和'='
之间的字符串。这是fieldname 这些命令可能会有所帮助:
regexp(lineThatHasBeenRead,'/(.+)=','tokens','once')
output(ct).(fieldname) = value;
修改强>
这是一些代码。我将你的例子保存为'test.txt'。
% open file
fid = fopen('test.txt');
% parse the file
eof = false;
geneCt = 1;
clear output % you cannot reassign output if it exists with different fieldnames already
output(1:1000) = struct; % you may want to initialize fields here
while ~eof
% read lines till we find one with CDS
foundCDS = false;
while ~foundCDS
currentLine = fgetl(fid);
% check for eof, then CDS. Allow whitespace at the beginning
if currentLine == -1
% end of file
eof = true;
elseif ~isempty(regexp(currentLine,'^\s+CDS','match','once'))
foundCDS = true;
end
end % looking for CDS
if ~eof
% read (and remember) lines till we find 'gene'
collectedLines = cell(1,20); % assume no more than 20 lines pere gene. Row vector for looping below
foundGene = false;
lineCt = 1;
while ~foundGene
currentLine = fgetl(fid);
% check for eof, then gene. Allow whitespace at the beginning
if currentLine == -1;
% end of file - consider all data has been read
eof = true;
foundGene = true;
elseif ~isempty(regexp(currentLine,'^\s+gene','match','once'))
foundGene = true;
else
collectedLines{lineCt} = currentLine;
lineCt = lineCt + 1;
end
end
% loop through collectedLines and assign. Do not loop through the
% gene line
for line = collectedLines(1:lineCt-1)
fieldname = regexp(line{1},'/(.+)=','tokens','once');
value = regexp(line{1},'="?([^"]+)"?$','tokens','once');
% try converting value to number
numValue = str2double(value);
if isfinite(numValue)
value = numValue;
else
value = value{1};
end
output(geneCt).(fieldname{1}) = value;
end
geneCt = geneCt + 1;
end
end % while eof
% cleanup
fclose(fid);
output(geneCt:end) = [];