将KEGG注释结果自动提取并组织到Excel中

时间:2018-06-15 15:58:29

标签: r bioinformatics genome

我在“KAAS - KEGG自动注释服务器”上启动了一个带有氨基酸序列的查询。

然后我下载了名为“myfile.keg”的结果文件。可以在以下位置下载一个显示其外观的小示例文件:https://www.dropbox.com/s/ixf0091z5q3cx9z/myfile.keg?dl=0

+D  KO
#<h2><a href="/kegg/kegg2.html"><img src="/Fig/bget/kegg3.gif" align="middle" border=0></a> &nbsp; KEGG Orthology (KO)</h2> 75prot_protdiff_GD_5h
!
A<b>Metabolism</b>
B
B  <b>Carbohydrate metabolism</b>
C    00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
D      MYGENEACCESSION01; K01623  ALDO; fructose-bisphosphate aldolase, class I [EC:4.1.2.13]
C    00020 Citrate cycle (TCA cycle) [PATH:ko00020]
C    00030 Pentose phosphate pathway [PATH:ko00030]
D      MYGENEACCESSION02; K01623  ALDO; fructose-bisphosphate aldolase, class I [EC:4.1.2.13]
C    00040 Pentose and glucuronate interconversions [PATH:ko00040]
C    00051 Fructose and mannose metabolism [PATH:ko00051]
D      MYGENEACCESSION03; K17497  PMM; phosphomannomutase [EC:5.4.2.8]
D      MYGENEACCESSION04; K01623  ALDO; fructose-bisphosphate aldolase, class I [EC:4.1.2.13]
C    00052 Galactose metabolism [PATH:ko00052]
C    00053 Ascorbate and aldarate metabolism [PATH:ko00053]
C    00500 Starch and sucrose metabolism [PATH:ko00500]
C    00520 Amino sugar and nucleotide sugar metabolism [PATH:ko00520]
D      MYGENEACCESSION05; K01183  E3.2.1.14; chitinase [EC:3.2.1.14]
C    00620 Pyruvate metabolism [PATH:ko00620]
C    00630 Glyoxylate and dicarboxylate metabolism [PATH:ko00630]
C    00640 Propanoate metabolism [PATH:ko00640]
C    00650 Butanoate metabolism [PATH:ko00650]
C    00660 C5-Branched dibasic acid metabolism [PATH:ko00660]
C    00562 Inositol phosphate metabolism [PATH:ko00562]
B

!
#<hr>
#<b>[ <a href="/kegg/ko.html">KO</a> | <a href="/kegg/brite.html">BRITE</a> | <a href="/kegg/kegg2.html">KEGG2</a> | <a href="/kegg/">KEGG</a> ]</b><br>
#Last updated: May 18, 2018
#<br><br><a href="/kegg-bin/get_htext?ko00001_all.keg">&raquo; All categories</a>

(我用Notepad ++打开它)

在这个文件中,你可以看到KEGG对我的每个基因的不同功能类别,后者被称为“MYGENEACCESSION01”(或 - “02”, - “03”等)。

我想从第一个file.keg中提取并整理所有信息到一个新文件(例如excel),如下所示:https://www.dropbox.com/s/xq4714ngesap9dx/annotation.xlsx?dl=0

CSV版本:

accession,kegg.first.level,kegg.second.level,kegg.third.level,kegg.fourth.level,path ,KO
MYGENEACCESSION01,metabolism,carbohydrate metabolism,glycolisis / Gluconeogenesis,"ALDO; fructose-bisphosphate aldolase, class I [EC:4.1.2.13]",PATH:ko00010,K01623
MYGENEACCESSION02,metabolism,carbohydrate metabolism,Pentose phosphate pathway ,"ALDO; fructose-bisphosphate aldolase, class I [EC:4.1.2.13]",PATH:ko00030,K01623  
MYGENEACCESSION03,metabolism,carbohydrate metabolism,Fructose and mannose metabolism,  PMM; phosphomannomutase [EC:5.4.2.8],PATH:ko00051,K17497
MYGENEACCESSION04,metabolism,carbohydrate metabolism,Fructose and mannose metabolism,"ALDO; fructose-bisphosphate aldolase, class I [EC:4.1.2.13]",PATH:ko00051,K01623  
MYGENEACCESSION05,metabolism,carbohydrate metabolism,Amino sugar and nucleotide sugar metabolism,chitinase [EC:3.2.1.14],PATH:ko00520,K01183

我手动完成了它,但它非常繁琐,我的数据集比提供的示例大得多。

有没有想过用R或其他程序自动完成? (您认为R脚本可以完成这项工作吗?)

0 个答案:

没有答案