我有两个文件。一个是像这样松散的数据库:
GO:0000001 mitochondrion inheritance P
GO:0000002 mitochondrial genome maintenance P
GO:0000003 GO:0019952 GO:0050876 reproduction P
GO:0000005 ribosomal chaperone activity F obs
GO:0000006 high affinity zinc uptake transmembrane transporter activity F
GO:0000007 low-affinity zinc ion transmembrane transporter activity F
GO:0000008 GO:0000013 thioredoxin F obs
GO:0000009 alpha-1,6-mannosyltransferase activity F
GO:0000010 trans-hexaprenyltranstransferase activity F
GO:0000011 vacuole inheritance P
GO:0000012 single strand break repair P
GO:0000014 single-stranded DNA specific endodeoxyribonuclease activity F
GO:0000015 phosphopyruvate hydratase complex C
GO:0000016 lactase activity F
GO:0000017 alpha-glucoside transport P
GO:0000018 regulation of DNA recombination P
GO:0000019 regulation of mitotic recombination P
另一个是我需要使用所述数据库“分类”的文件。它看起来像这样:
gene_id_100000 GO:0004370 69.52 187 57 0 7 193 4 190 1e-90 280
gene_id_100000 GO:0005524 69.52 187 57 0 7 193 4 190 1e-90 280
gene_id_100000 GO:0006071 69.52 187 57 0 7 193 4 190 1e-90 280
gene_id_100000 GO:0006072 69.52 187 57 0 7 193 4 190 1e-90 280
gene_id_100000 GO:0019563 69.52 187 57 0 7 193 4 190 1e-90 280
gene_id_100002 GO:0000105 99.42 173 1 0 1 173 256 428 8e-122 357
gene_id_100002 GO:0004399 99.42 173 1 0 1 173 256 428 8e-122 357
gene_id_100002 GO:0008270 99.42 173 1 0 1 173 256 428 8e-122 357
gene_id_100002 GO:0051287 99.42 173 1 0 1 173 256 428 8e-122 357
gene_id_100008 GO:0005737 84.35 147 23 0 7 153 5 151 1e-90 267
gene_id_100008 GO:0008616 84.35 147 23 0 7 153 5 151 1e-90 267
gene_id_100008 GO:0033739 84.35 147 23 0 7 153 5 151 1e-90 267
gene_id_100008 GO:0046857 84.35 147 23 0 7 153 5 151 1e-90 267
gene_id_100017 GO:0003938 71.75 177 50 0 1 177 75 251 6e-86 268
如您所见,文件之间的常用术语是GO:。我唯一关心的是我需要分类的文件的前两列(即带有gene_id的那一列和带有GO的文件:)以及来自数据库的每个GO:term的描述。
输出应如下所示(查询文件中的第2列,后跟数据库中与文件中的GO术语匹配的描述):
gene_id_100000 GO:0004370 glycerol kinase activity F
gene_id_100000 GO:0005524 ATP binding F
gene_id_100000 GO:0006071 glycerol metabolic process P
gene_id_100000 GO:0006072 glycerol-3-phosphate metabolic process P
gene_id_100000 GO:0019563 glycerol catabolic process P
gene_id_100002 GO:0000105 histidine biosynthetic process P
数据库中的某些行有超过1个GO:term,所以我真的无法让它工作......而且,我真的不知道如何同时处理2个文件在awk。
提前感谢您的帮助!希望我能清楚地解释清楚。
编辑jaypal:缺少的一些行是我用作文件示例的行:
gene_id_100000 GO:0004370 69.52 187 57 0 7 193 4 190 1e-90 280
gene_id_100000 GO:0005524 69.52 187 57 0 7 193 4 190 1e-90 280
gene_id_100000 GO:0006071 69.52 187 57 0 7 193 4 190 1e-90 280
gene_id_100000 GO:0006072 69.52 187 57 0 7 193 4 190 1e-90 280
gene_id_100000 GO:0019563 69.52 187 57 0 7 193 4 190 1e-90 280
gene_id_100002 GO:0000105 99.42 173 1 0 1 173 256 428 8e-122 357
他们相应的数据库行是:
GO:0004370 glycerol kinase activity F
GO:0005524 ATP binding F
GO:0006071 glycerol metabolic process P
GO:0006072 glycerol-3-phosphate metabolic process P
GO:0019563 glycerol catabolic process P
GO:0000105 histidine biosynthetic process P
答案 0 :(得分:2)
awk '
NR==FNR {
line = $0;
gsub(/GO:[0-9]+[ \t]*/, "", line);
for(i=1; i<=NF && substr($i, 1, 3) == "GO:"; ++i)
desc[$i] = line;
next;
}
{ print $1, $2, desc[$2]; }
' database file
第一个块仅对第一个文件执行;仅第二个文件的第二个块。第一个文件是代码和描述的数据库。对该行上的每个GO号进行哈希描述。对于第二个文件,将使用其他信息打印说明。
答案 1 :(得分:1)
使用awk
:
awk '
NR==FNR {
genes[$2] = $1;
next
}
{
line = $0;
gsub (/GO:[[:digit:]]+[[:space:]]*/, "", line);
for (i=1; i<=NF; i++) {
if ($i in genes) {
print genes[$i], $i, line;
next
}
}
}
' file database
GO
个序列。 GO:
序列和行。