如何将文本文件读入matlab并将其作为列表?

时间:2010-06-13 23:27:12

标签: matlab file-io import text-files

我有一个格式为

的文本文件
 gene            complement(22995..24539)
                 /gene="ppp"
                 /locus_tag="MRA_0020"
 CDS             complement(22995..24539)
                 /gene="ppp"
                 /locus_tag="MRA_0020"
                 /codon_start=1
                 /transl_table=11
                 /product="putative serine/threonine phosphatase Ppp"
                 /protein_id="ABQ71738.1"
                 /db_xref="GI:148503929"
 gene            complement(24628..25095)
                 /locus_tag="MRA_0021"
 CDS             complement(24628..25095)
                 /locus_tag="MRA_0021"
                 /codon_start=1
                 /transl_table=11
                 /product="hypothetical protein"
                 /protein_id="ABQ71739.1"
                 /db_xref="GI:148503930"
 gene            complement(25219..26802)
                 /locus_tag="MRA_0022"
 CDS             complement(25219..26802)
                 /locus_tag="MRA_0022"
                 /codon_start=1
                 /transl_table=11
                 /product="hypothetical protein"
                 /protein_id="ABQ71740.1"
                 /db_xref="GI:148503931"

我想将文本文件读入Matlab并制作一个列表,其中包含来自line gene的信息作为列表中每个项目的起点。因此,对于此示例,列表中将有3个项目。我尝试了一些东西,但是无法让它发挥作用。任何人对我能做什么有任何想法?

1 个答案:

答案 0 :(得分:2)

以下是算法的快速建议:

  1. 使用fopen
  2. 打开文件
  3. 开始阅读fgetl行,直至找到以'CDS'开头的行。
  4. 保持阅读行,直到您获得以'gene'开头的另一行。
  5. 对于(2)和(3)中的行之间的所有行
    • 找到'/''='之间的字符串。这是fieldname
    • 找到引号之间的字符串。这是字段的值
  6. 将计数器加1,从#2开始,直到您完成阅读文件
  7. 这些命令可能会有所帮助:

    • 查找特定包含的字符串 字符,例如使用 regexp(lineThatHasBeenRead,'/(.+)=','tokens','once')
    • 创建 输出结构,使用动态 字段名称,例如 output(ct).(fieldname) = value;

    修改

    这是一些代码。我将你的例子保存为'test.txt'。

    % open file
    fid = fopen('test.txt');
    
    % parse the file
    eof = false;
    geneCt = 1;
    clear output % you cannot reassign output if it exists with different fieldnames already
    output(1:1000) = struct; % you may want to initialize fields here
    while ~eof
        % read lines till we find one with CDS
        foundCDS = false;
        while ~foundCDS
            currentLine = fgetl(fid);
            % check for eof, then CDS. Allow whitespace at the beginning
            if currentLine == -1
                % end of file
                eof = true;
            elseif ~isempty(regexp(currentLine,'^\s+CDS','match','once'))
                foundCDS = true;
            end
        end % looking for CDS
    
        if ~eof
    
            % read (and remember) lines till we find 'gene'
            collectedLines = cell(1,20); % assume no more than 20 lines pere gene. Row vector for looping below
            foundGene = false;
            lineCt = 1;
            while ~foundGene
                currentLine = fgetl(fid);
                % check for eof, then gene. Allow whitespace at the beginning
                if currentLine == -1;
                    % end of file - consider all data has been read
                    eof = true;
                    foundGene = true;
                elseif ~isempty(regexp(currentLine,'^\s+gene','match','once'))
                    foundGene = true;
                else
                    collectedLines{lineCt} = currentLine;
                    lineCt = lineCt + 1;
                end
            end
    
            % loop through collectedLines and assign. Do not loop through the
            % gene line
            for line = collectedLines(1:lineCt-1)
                fieldname = regexp(line{1},'/(.+)=','tokens','once');
                value = regexp(line{1},'="?([^"]+)"?$','tokens','once');
                % try converting value to number
                numValue = str2double(value);
                if isfinite(numValue)
                    value = numValue;
                else
                    value = value{1};
                end
                output(geneCt).(fieldname{1}) = value;
            end
            geneCt = geneCt + 1;
        end
    end % while eof
    
    % cleanup
    fclose(fid);
    output(geneCt:end) = [];