我有一个制表符分隔的文件,该文件包含三列(GO ID,生物过程,基因),并且我想比较第三列的行,如果它们匹配,我想分别加入其第一列和第二列。我是一种新编程,我尝试了许多效率不高的方法,但没有得到想要的结果。
这是所需输出的一个示例。
输入
GO:0007155 cell adhesion ACHE
GO:0022610 biological adhesion ACHE
GO:0007155 cell adhesion ADAM19
GO:0022610 biological adhesion ADAM19
GO:0007155 cell adhesion AMBN
GO:0022610 biological adhesion AMBN
输出:
GO:0007155;GO:0022610 cell adhesion;biological adhesion ACHE
GO:0007155;GO:0022610 cell adhesion;biological adhesion ADAM19
GO:0007155;GO:0022610 cell adhesion;biological adhesion AMBN
答案 0 :(得分:1)
我用制表符分隔了您的数据。
$: cat cols
GO:0007155 cell adhesion ACHE
GO:0022610 biological adhesion ACHE
GO:0007155 cell adhesion ADAM19
GO:0022610 biological adhesion ADAM19
GO:0007155 cell adhesion AMBN
GO:0022610 biological adhesion AMBN
$: declare -A A B C # associative arrays - "lookup tables"
$: tab=$'\t' # just to make it easier to see it embedded
$: while IFS=$'\t' read a b c
do A[$c]="${A[$c]};$a"
B[$c]="${B[$c]};$b"
done < cols # stack cols
$: for c in "${!A[@]}"
do echo "${A[$c]#;}$tab${B[$c]#;}$tab$c" # strip leading semicolons
done
GO:0007155;GO:0022610 cell adhesion;biological adhesion ADAM19
GO:0007155;GO:0022610 cell adhesion;biological adhesion AMBN
GO:0007155;GO:0022610 cell adhesion;biological adhesion ACHE
输出顺序重要吗? 例如,如果您需要按字母顺序排列它们,则可以使用以下方法:
$: for c in $( printf "%s\n" "${!A[@]}" | sort )
do echo "${A[$c]#;}$tab${B[$c]#;}$tab$c"
done
答案 1 :(得分:1)
使用经常使用的GNU datamash加上某种按摩以获得所需格式的输出的一种方法:
$ datamash -g 3 collapse 1 collapse 2 < input.tsv | \
awk 'BEGIN { FS=OFS="\t" } { print $2, $3, $1 }' | tr , ';'
GO:0007155;GO:0022610 cell adhesion;biological adhesion ACHE
GO:0007155;GO:0022610 cell adhesion;biological adhesion ADAM19
GO:0007155;GO:0022610 cell adhesion;biological adhesion AMBN
(这假设文件是根据示例数据中的第三列进行排序的)
在perl中:
$ perl -F"\t" -lane 'push @{$genes{$F[2]}}, [@F[0,1]];
END { $,="\t";
for (sort keys %genes) {
print join(";", map { $_->[0] } @{$genes{$_}}),
join(";", map { $_->[1] } @{$genes{$_}}),
$_ } }' input.tsv
GO:0007155;GO:0022610 cell adhesion;biological adhesion ACHE
GO:0007155;GO:0022610 cell adhesion;biological adhesion ADAM19
GO:0007155;GO:0022610 cell adhesion;biological adhesion AMBN