我有这个文件
427 A C A/C 12
436 G C G/C 12
445 C T C/T 12
447 A G A/G 9
451 T C T/C 5
456 A G A/G 12
493 G A G/A 12
我想阅读第一列并找到差异小于10的所有其他ID。
427 A C A/C 12 436
436 G C G/C 12 427,445
445 C T C/T 12 436,447,451
447 A G A/G 9 445,451,456
451 T C T/C 5 445,447,456
456 A G A/G 12 451,447
493 G A G/A 12
最后一栏应该如上所述。所有id与特定id相距+或 - 10个碱基。例如对于436,边界是{426 - 446}其他id在该范围内是427和445所以我在第6列显示它们。
答案 0 :(得分:3)
这是使用Perl的一种方式:
use strict;
use warnings;
open my $fh, '<', 'dataFile.txt' or die $!;
chomp( my @data = <$fh> );
close $fh;
my @IDs = map /(\d+)/, @data;
for (@data) {
my ($id) = /(\d+)/;
print "$_\t"
. ( join ',', grep { abs $id - $_ < 11 and $id != $_ } @IDs )
. "\n";
}
输出:
427 A C A/C 12 436
436 G C G/C 12 427,445
445 C T C/T 12 436,447,451
447 A G A/G 9 445,451,456
451 T C T/C 5 445,447,456
456 A G A/G 12 447,451
493 G A G/A 12
答案 1 :(得分:2)
这是使用GNU awk
的一种方式。像:
awk -f script.awk file.txt{,} | column -t
script.awk
的内容:
FNR==NR {
array[$1]++
next
}
{
n = asorti(array,sort)
for (i=1; i<=n; i++) {
if (sort[i] <= $1 + 10 && sort[i] >= $1 - 10 && $1 != sort[i]) {
line = (line ? line "," : line) sort[i]
}
}
print $0, line
line = ""
}
结果:
427 A C A/C 12 436
436 G C G/C 12 427,445
445 C T C/T 12 436,447,451
447 A G A/G 9 445,451,456
451 T C T/C 5 445,447,456
456 A G A/G 12 447,451
493 G A G/A 12
或者,这是单行:
awk 'FNR==NR { array[$1]++; next } { n = asorti(array,sort); for (i=1; i<=n; i++) if (sort[i] <= $1 + 10 && sort[i] >= $1 - 10 && $1 != sort[i]) line = (line ? line "," : line) sort[i]; print $0, line; line = "" }' file.txt{,} | column -t