我有一个非常大的文本文件,其格式如下:
gene1 gene2
基因3
gene4
gene5
gene6
gene7 gene8
gene9
...
我希望这个文件的格式如下:
gene1 gene2
gene1 gene3
gene1 gene4
gene1 gene5
gene1 gene6
gene7 gene8
gene7 gene9
...
gene1,gene2等..是一些字母组合,没有可以有不同长度的空格。下面是一个示例文件
https://drive.google.com/open?id=0B6u8fZadKIp2aEVIUTJ6NzlJVlk
有人可以指出我正确的方向吗?
答案 0 :(得分:2)
% getting the text and the first word
text_in_file = fileread('oldfle.txt');
first_word = regexp(text_in_file, '\S*', 'match','once');
% generating the new string
str = regexprep(text_in_file,'[\n\r]+',['\n\n' first_word ' ']);
% writing to the file
fid = fopen('newfile.txt', 'wt');fprintf(fid, str);fclose(fid);
这是一个修改过的代码,它将处理包含2个基因的许多行的情况。它重置计数并开始在单基因行前插入新的基因名称。那是你想要的吗?
% getting the text
text_in_file = fileread('oldfile.txt');
% splitting into rows
rows = regexp(text_in_file,'\n','split');
% number of genes in the rows
A = cellfun(@(x) numel(regexp(x, '\t')), rows);
% row indices with two genes
two_word_rows = find(A==2);
% first genes
first_words = cellfun(@(x) regexp(x, '\S+', 'match', 'once'), rows(two_word_rows), 'UniformOutput' , false);
% modifying the rows
for i=setdiff(1:numel(rows), two_word_rows) % exclude the two gene rows
last_idx = find(two_word_rows<i,1,'last'); % which word to add?
rows{i} = sprintf('%s\t%s', char(first_words(last_idx)), rows{i});
end
% writing to the file
fid = fopen('newfile.txt', 'wt');
fprintf(fid, '%s', rows{:});
fclose(fid);
请不要只复制并粘贴代码。尝试浏览它,阅读注释并查看所用函数的文档。
答案 1 :(得分:1)
此代码导入所有32491个基因名称,然后将它们写入新文件。
oldfile='file.txt';
newfile='file2.txt';
fclose all;
fid=fopen(oldfile,'r');
genes={};
l=fgetl(fid);
while ~isnumeric(l)
l = regexp(l, '\W', 'split');
l = l(~cellfun(@isempty,l));
if ~isempty(l)
genes(end+1:end+numel(l))=l;
end
l=fgetl(fid);
end
fclose(fid);
fid=fopen(newfile,'wt');
for ct = 2:numel(genes)
fprintf(fid,'%s %s\n',genes{1},genes{ct});
end
fclose(fid);
输出:
TGM1 HIST1H4C
TGM1 HIST1H4B
TGM1 HIST1H4A
TGM1 TGM3
TGM1 HIST1H4G
TGM1 HIST1H4F
TGM1 HIST1H4E
TGM1 HIST1H4D
TGM1 HIST1H4K
TGM1 HIST1H4J
(etc.)