我有两个文件,如下所示,以制表符分隔:
档案A
chr1 123 aa b c d
chr1 234 a b c d
chr1 345 aa b c d
chr1 456 a b c d
....
档案B
xxxx abcd chr1 123 aa c d e
yyyy defg chr1 345 aa e f g
...
我想基于3列“chr1”,“123”和“aa”加入这两个文件,并将文件B中的前两列添加到文件A中,这样输出如下所示: 输出:
chr1 123 aa b c d xxxx abcd
chr1 234 a b c d
chr1 345 aa b c d yyyy defg
chr1 456 a b c d
任何人都可以帮助在awk中执行此操作。如果可能的话使用awk oneliners?
答案 0 :(得分:11)
以下是使用awk
的一种方法:
$ awk 'NR==FNR{a[$3,$4]=$1OFS$2;next}{$6=a[$1,$2];print}' OFS='\t' fileb filea
chr1 123 a b c xxxx abcd
chr1 234 a b c
chr1 345 a b c yyyy defg
chr1 456 a b c
说明:
NR==FNR # current recond num match the file record num i.e in filea
a[$3,$4]=$1OFS$2 # Create entry in array with fields 3 and 4 as the key
next # Grab the next line (don't process the next block)
$6=a[$1,$2] # Assign the looked up value to field 6 (+rebuild records)
print # Print the current line & the matching entry from fileb ($6)
OFS='\t' # Seperate each field with a single TAB on output
修改强>
对于3字段问题,您可以简单地添加额外字段:
$ awk 'NR==FNR{a[$3,$4,$5]=$1OFS$2;next}{$6=a[$1,$2,$3];print}' OFS='\t' fileb filea
chr1 123 aa b c xxxx abcd
chr1 234 a b c
chr1 345 aa b c yyyy defg
chr1 456 a b c
答案 1 :(得分:2)
您可以使用join
,但管道变得如此复杂,可能更容易切换到更强大的语言,如Perl。
join -11 -21 -o1.1,1.2,1.3,1.4,1.5,2.4,2.5 \
<(sed 's/ \+/:/' fileA | sort) \
<(sed 's/ \+/:/' fileB | sort) \
| join -11 -22 -a1 -o1.1,1.2,1.3,1.4,1.5,1.6,1.7,2.5,2.6 \
- <(sed 's/ \+\([^ ]\+\) \+\([^ ]\+\)/ \1:\2/' fileC | sort -k2) \
| sed 's/:/ /'
Perl解决方案,使用哈希来记住所有信息:
#!/usr/bin/perl
use warnings;
use strict;
# key_start key_end keep_from output
my %files = (A => [0, 1, 2, [0 .. 3]],
B => [0, 1, 2, [-2, -1]],
C => [1, 2, 3, [-2, -1]],
);
my %hash;
for my $file (keys %files) {
open my $FH, '<', "file$file" or die "file$file: $!";
while (<$FH>) {
my @fields = split;
$hash{"@fields[$files{$file}[0], $files{$file}[1]]"}{$file}
= [ @fields[$files{$file}[2] .. $#fields] ];
}
}
for my $key (sort keys %hash) {
print $key, join(' ', q(),
grep defined, map {
@{ $hash{$key}{$_} }[@{ $files{$_}[-1] }]
} sort keys %files), "\n";
}