我想解析TopGO R包的InterProScan结果。
我希望文件的格式与我的格式相差无几。
# input file (gene_ID GO_ID1, GO_ID2, GO_ID3, ....)
Q97R95 GO:0004349, GO:0005737, GO:0006561
Q97R95 GO:0004349, GO:0006561
Q97R95 GO:0005737, GO:0006561
Q97R95 GO:0006561
# desired output (removed duplicates and rows collapsed)
Q97R95 GO:0004349,GO:0005737,GO:0006561
您可以在此处使用整个数据文件测试您的工具:
https://drive.google.com/file/d/0B8-ZAuZe8jldMHRsbGgtZmVlZVU/view?usp=sharing
答案 0 :(得分:1)
你可以使用gnu awk的二维数组:
awk -F'[, ]+' '{for(i=2;i<=NF;i++)r[$1][$i]}
END{for(x in r){
printf "%s ",x;b=0;
for(y in r[x]){printf "%s%s",(b?",":""),y;b=1}
print ""}
}' file
它给出了:
Q97R95 GO:0005737,GO:0006561,GO:0004349
删除了重复的字段,但未保留订单。
答案 1 :(得分:0)
这是一个有希望整洁的Perl解决方案。它尽可能保留键和值的顺序,并且不会将整个文件内容保留在内存中,只需要尽可能多地完成工作。
#!perl
use strict;
use warnings;
my ($prev_key, @seen_values, %seen_values);
while (<>) {
# Parse the input
chomp;
my ($key, $values) = split /\s+/, $_, 2;
my @values = split /,\s*/, $values;
# If we have a new key...
if ($key ne $prev_key) {
# output the old data, as long as there is some,
if (@seen_values) {
print "$prev_key\t", join(", ", @seen_values), "\n";
}
# clear it out,
@seen_values = %seen_values = ();
# and remember the new key for next time.
$prev_key = $key;
}
# Merge this line's values with previous ones, de-duplicating
# but preserving order.
for my $value (@values) {
push @seen_values, $value unless $seen_values{$value}++;
}
}
# Output what's left after the last line
if (@seen_values) {
print "$prev_key\t", join(", ", @seen_values), "\n";
}