我想根据文件中的mir_seq
合并100个文件。输出应该是包含mir_seq
的一个文件和原始文件中的列freq
。
文件如下所示:
文件1:
mir_seq seq name freq mir start end mism add t5 t3 s5 s3 DB ambiguity
hsa-miR-143-3p_TGAGAAGAAGCACTGTAGCTCTT TGAGAAGAAGCACTGTAGCTCTT seq_100006_x0 0 hsa-miR-143-3p 61 81 6AT u-TT 0 0 AGTCTGAG GCTCAGGA miRNA 1
hsa-miR-10a-5p_GACCCTGTAGATCCGAATTTGTA GACCCTGTAGATCCGAATTTGTA seq_100012_x1 1 hsa-miR-10a-5p 22 43 1GT u-A 0 u-G TATATACC TGTGTAAG miRNA 1
hsa-miR-10a-5p_GACCCTGTAGATCCGAATTTGTG GACCCTGTAGATCCGAATTTGTG seq_100013_x54 54 hsa-miR-10a-5p 22 44 1GT 0 0 0 TATATACC TGTGTAAG miRNA 1
file2的:
mir_seq seq name freq mir start end mism add t5 t3 s5 s3 DB ambiguity
hsa-miR-143-3p_TGAGAAGAAGCACTGTAGCTCTT TGAGAAGAAGCACTGTAGCTCTT seq_100006_x1 1 hsa-miR-143-3p 61 81 6AT u-TT 0 0 AGTCTGAG GCTCAGGA miRNA 1
hsa-miR-10a-5p_GACCCTGTAGATCCGAATTTGTA GACCCTGTAGATCCGAATTTGTA seq_100012_x0 0 hsa-miR-10a-5p 22 43 1GT u-A 0 u-G TATATACC TGTGTAAG miRNA 1
hsa-miR-10a-5p_GACCCTGTAGATCCGAATTTGTG GACCCTGTAGATCCGAATTTGTG seq_100013_x24 24 hsa-miR-10a-5p 22 44 1GT 0 0 0 TATATACC TGTGTAAG miRNA 1
hsa-miR-1296-5p_TTAGGGCCCTGGCTCCATCT TTAGGGCCCTGGCTCCATCT seq_100019_x17 17 hsa-miR-1296-5p 16 35 0 0 0 u-CC TGGGTTAG CTCCTTTA miRNA 1
这些文件的名称是这样的,只是_
和.txt.mirna
之间的部分不同,并且是以制表符分隔的:
Miraligner_94G.txt.mirna
Miraligner_944G.txt.mirna
输出文件应该是这样的:
mir_seq freq_94G freq_944G freq_912R
hsa-miR-143-3p_TGAGAAGAAGCACTGTAGCTCTT 0 12 55
答案 0 :(得分:2)
您只提供了一个示例输入文件,因此显然未经测试,因为您无法仅使用1个文件测试“合并”:
awk '
FNR==1 {
split(FILENAME,tmp,/[_.]/)
sfx = tmp[2]
sfxs[sfx]
}
{
keys[$1]
val[$1,sfx] = $4
}
END {
printf "mir_seq"
for (sfx in sfxs) {
printf "%sfreq_%s", OFS, sfx
}
print ""
for (key in keys) {
printf "%s", key
for (sfx in sfxs) {
printf "%s%d", OFS, val[key,sfx]
}
print ""
}
}
' Miraligner_*
答案 1 :(得分:2)
好的,鉴于您正在处理文件:
Miraligner_94G.txt.mirna
Miraligner_944G.txt.mirna
看起来你只是挑选每个列。
所以:
#!/usr/bin/env perl
use strict;
use warnings;
my %data;
my %seen;
foreach my $file ( glob("Miraligner_*") ) {
my ($freq_id) = ( $file =~ m/\_(\w+).txt/ );
$freq_id = "freq_$freq_id";
$seen{$freq_id}++;
open( my $input, "<", $file ) or die $!;
my @headers = split( ' ', <$input> );
while (<$input>) {
my %line;
@line{@headers} = split;
my $key = $line{'mir_seq'};
$data{$key}{$freq_id} = $line{'freq'};
}
close($input);
}
my @cols = sort keys %seen;
print join( "\t", "mir_seq", @cols ), "\n";
foreach my $mir_seq ( sort keys %data ) {
my @output_cols = map { $_ // 0 } @{ $data{$mir_seq} }{@cols};
print join( "\t", $mir_seq, @output_cols ), "\n";
}
给定数据集输出(制表符分隔):
mir_seq freq_944G freq_94G
hsa-miR-10a-5p_GACCCTGTAGATCCGAATTTGTA 1 0
hsa-miR-10a-5p_GACCCTGTAGATCCGAATTTGTG 54 24
hsa-miR-1296-5p_TTAGGGCCCTGGCTCCATCT 0 17
hsa-miR-143-3p_TGAGAAGAAGCACTGTAGCTCTT 0 1
注意 - 如果值未定义,则当前将打印零。如果要打印其他内容,则需要修改该地图。
它也按字母顺序对大多数这些进行排序 - 这也可能不是你想要的,但是你可以参考很多排序的例子。