我有这两个.dat文件(我只显示前两行的20行):
GO:0005509 PDCD6
GO:0004672 CDK1
GO:0005524 CDK1
GO:0005634 CDK1
GO:0005737 CDK1
GO:0006468 CDK1
GO:0005615 SERPINB6
GO:0006629 APOC2
GO:0006869 APOC2
GO:0008047 APOC2
GO:0042627 APOC2
GO:0043085 APOC2
GO:0001932 TADA2L
GO:0003677 TADA2L
GO:0005671 TADA2L
GO:0006357 TADA2L
GO:0007067 TADA2L
GO:0008270 TADA2L
GO:0016573 TADA2L
和
GO:0000001 mitochondrion inheritance
GO:0000002 mitochondrial genome maintenance
GO:0000003 reproduction
GO:0000005 ribosomal chaperone activity
GO:0000006 high affinity zinc uptake transmembrane transporter activity
GO:0000007 low-affinity zinc ion transmembrane transporter activity
GO:0000008 thioredoxin
GO:0000009 alpha-1,6-mannosyltransferase activity
GO:0000010 trans-hexaprenyltranstransferase activity
GO:0000011 vacuole inheritance
GO:0000012 single strand break repair
GO:0000014 single-stranded DNA specific endodeoxyribonuclease activity
GO:0000015 phosphopyruvate hydratase complex
GO:0000016 lactase activity
GO:0000017 alpha-glucoside transport
GO:0000018 regulation of DNA recombination
GO:0000019 regulation of mitotic recombination
GO:0000020 negative regulation of recombination within rDNA repeats
(...)
当我尝试为两个文件建立连接时,我只得到一些结果(正好是10个)。完整的代码是:
ls *gene_association* | while read file;
do
echo;
echo @@@ File: $file;
echo;
# New file "assoc_specie.txt"
IFS='_' read -r -a array <<< "$file"
SPECIE=${array[2]}
#Filtering comments (!comment...)
cat $file | grep -v '!' > assoc_$SPECIE.txt;
gawk 'BEGIN{OFS="\t";FS="\t"}{print $5, $3}' assoc_$ESPECIE.txt > goTerms_$ESPECIE.dat;
join goTerms_$SPECIE.dat gene_ontology.dat > join.dat
echo
done;
我不知道我做错了什么,但很明显,加入并未显示所有结果。
提前致谢
PS:assoc_specie.txt文件具有此格式(仅显示第一行):
UniProtKB A0A024QZ42 PDCD6 GO:0005509 GO_REF:0000002 IEA InterPro:IPR002048 F HCG1985580, isoform CRA_c A0A024QZ42_HUMAN|PDCD6|hCG_1985580 protein taxon:9606 20160312 InterPro
(...)
答案 0 :(得分:-2)
感谢大家。我刚刚对第一个文件进行了排序和uniq,它完美无缺!感谢!!!