我有一个基因的GTF文件,我试图解析,以便' gene_id,' ' gene_type,' ' gene_status,' ' gene_name,'和级别都在不同的列中。
因此对于我的原始文件:
chr1 | ENSEMBL gene| 17369| 17436| . - . |gene_id "ENSG00000278267.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; level 3;
chr1 | ENSEMBL gene| 30366| 30503| . + . |gene_id "ENSG00000274890.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR1302-2"; level 3;
chr1 | ENSEMBL gene| 157784| 157887| . - . |gene_id "ENSG00000222623.1"; gene_type "snRNA"; gene_status "KNOWN"; gene_name "RNU6-1100P"; level 3;
chr1 | ENSEMBL gene| 187891| 187958| . - . |gene_id "ENSG00000273874.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-2"; level 3;
我希望它看起来像这样,有' gene_id,' ' gene_type,' ' gene_status,' ' gene_name,'并将所有级别设置为SEPARATE列:
chr1 |ENSEMBL |gene| 17369| |17436 |. - . |gene_id "ENSG00000278267.1" |gene_type "miRNA" |gene_status "KNOWN" |gene_name "MIR6859-1" |level 3
chr1 |ENSEMBL |gene| 30366| 30503 |. + . |gene_id "ENSG00000274890.1" |gene_type "miRNA" |gene_status "KNOWN" |gene_name "MIR1302-2" |level 3
chr1 |ENSEMBL |gene| 157784| 157887 |. - . |gene_id "ENSG00000222623.1" |gene_type "snRNA" |gene_status "KNOWN" |gene_name "RNU6-1100P" |level 3
chr1 |ENSEMBL |gene| 187891| 187958 |. - . |gene_id "ENSG00000273874.1" |gene_type "miRNA" |gene_status "KNOWN" |gene_name "MIR6859-2" |level 3
我尝试使用gffutils解析它,使用它们提供的基本代码:
import gffutils
db = gffutils.create_db("sRNA.gene.gtf", dbfn='sRNA.gene.gtf.db')
print(list(db.featuretypes()))
# Here's how to write genes out to file
with open('sRNA.gene.gtf', 'w') as fout:
for gene in db.features_of_type('gene'):
fout.write(str(gene) + '\n')
但是,我收到了一个' ImportError:无法导入名称'功能:'
ImportError Traceback (most recent call last)
<ipython-input-26-4dd7cd5c7e24> in <module>()
2
3
----> 4 db = gffutils.create_db("sRNA.gene.gtf", dbfn='sRNA.gene.gtf.db')
5
6 #db = gffutils.FeatureDB('sRNA.gene.gtf.db')
我不确定这里出了什么问题,现在我正在考虑尝试使用命令行解析它。有谁可以提供一些解决GTF文件的最佳方法的建议?
提前谢谢你。
答案 0 :(得分:0)
您希望将GTF文件中的多个分隔符更改为单个制表符分隔符。完成后,该文件不再是GTF文件。
以下代码将GTF文件的内容转换为文本文件
import gffutils
try:
db = gffutils.create_db("sample.gtf", dbfn='sample.db')
except:
pass
db = gffutils.FeatureDB('sample.db', keep_order=True)
with open('sample.txt', 'w') as fout:
for line in db.all_features():
line = str(line)
line = line.split(";") #make your parsing changes here
fout.write(str(line) + '\n')
请注意,您只能使用create_db()
方法一次。这就是我评论它的原因。
修改强>
添加了试用声明
答案 1 :(得分:0)
您可以使用pyranges库来解析gtf / gff,然后将属性列中的每个条目作为单个列获取。
安装说明:
# pip install pyranges
# or
# conda install -c bioconda pyranges
示例文件:
# !head ensembl.gtf
# #!genome-build GRCh38.p10
# #!genome-version GRCh38
# #!genome-date 2013-12
# #!genome-build-accession NCBI:GCA_000001405.25
# #!genebuild-last-updated 2017-06
# 1 havana gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";
# 1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1";
# 1 havana exon 11869 12227 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1";
# 1 havana exon 12613 12721 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00003582793"; exon_version "1"; tag "basic"; transcript_support_level "1";
# 1 havana exon 13221 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "3"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00002312635"; exon_version "1"; tag "basic"; transcript_support_level "1";
使用pyranges
import pyranges as pr
# as PyRanges-object
gr = pr.read_gtf("ensembl.gtf")
# +--------------+------------+--------------+-----------+-----------+------------+--------------+------------+-----------------+----------------+-------------+----------------+------------------------------------+-----------------+----------------------+-------+
# | Chromosome | Source | Feature | Start | End | Score | Strand | Frame | gene_id | gene_version | gene_name | gene_source | gene_biotype | transcript_id | transcript_version | +13 |
# | (category) | (object) | (category) | (int32) | (int32) | (object) | (category) | (object) | (object) | (object) | (object) | (object) | (object) | (object) | (object) | ... |
# |--------------+------------+--------------+-----------+-----------+------------+--------------+------------+-----------------+----------------+-------------+----------------+------------------------------------+-----------------+----------------------+-------|
# | 1 | havana | gene | 11869 | 14409 | . | + | . | ENSG00000223972 | 5 | DDX11L1 | havana | transcribed_unprocessed_pseudogene | nan | nan | ... |
# | 1 | havana | transcript | 11869 | 14409 | . | + | . | ENSG00000223972 | 5 | DDX11L1 | havana | transcribed_unprocessed_pseudogene | ENST00000456328 | 2 | ... |
# | 1 | havana | exon | 11869 | 12227 | . | + | . | ENSG00000223972 | 5 | DDX11L1 | havana | transcribed_unprocessed_pseudogene | ENST00000456328 | 2 | ... |
# | 1 | havana | exon | 12613 | 12721 | . | + | . | ENSG00000223972 | 5 | DDX11L1 | havana | transcribed_unprocessed_pseudogene | ENST00000456328 | 2 | ... |
# | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
# | 1 | ensembl | transcript | 120725 | 133723 | . | - | . | ENSG00000238009 | 6 | AL627309.1 | ensembl_havana | lincRNA | ENST00000610542 | 1 | ... |
# | 1 | ensembl | exon | 133374 | 133723 | . | - | . | ENSG00000238009 | 6 | AL627309.1 | ensembl_havana | lincRNA | ENST00000610542 | 1 | ... |
# | 1 | ensembl | exon | 129055 | 129223 | . | - | . | ENSG00000238009 | 6 | AL627309.1 | ensembl_havana | lincRNA | ENST00000610542 | 1 | ... |
# | 1 | ensembl | exon | 120874 | 120932 | . | - | . | ENSG00000238009 | 6 | AL627309.1 | ensembl_havana | lincRNA | ENST00000610542 | 1 | ... |
# +--------------+------------+--------------+-----------+-----------+------------+--------------+------------+-----------------+----------------+-------------+----------------+------------------------------------+-----------------+----------------------+-------+
# Stranded PyRanges object has 95 rows and 28 columns from 1 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.
# 13 hidden columns: transcript_name, transcript_source, transcript_biotype, tag, transcript_support_level, exon_number, exon_id, exon_version, (assigned, previous, ccds_id, protein_id, protein_version
# as DataFrame
df = gr.df
# Chromosome Source Feature Start End Score Strand Frame gene_id gene_version gene_name ... transcript_biotype tag transcript_support_level exon_number exon_id exon_version (assigned previous ccds_id protein_id protein_version
# 0 1 havana gene 11869 14409 . + . ENSG00000223972 5 DDX11L1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 1 1 havana transcript 11869 14409 . + . ENSG00000223972 5 DDX11L1 ... processed_transcript basic 1 NaN NaN NaN NaN NaN NaN NaN NaN
# 2 1 havana exon 11869 12227 . + . ENSG00000223972 5 DDX11L1 ... processed_transcript basic 1 1 ENSE00002234944 1 NaN NaN NaN NaN NaN
# 3 1 havana exon 12613 12721 . + . ENSG00000223972 5 DDX11L1 ... processed_transcript basic 1 2 ENSE00003582793 1 NaN NaN NaN NaN NaN
# 4 1 havana exon 13221 14409 . + . ENSG00000223972 5 DDX11L1 ... processed_transcript basic 1 3 ENSE00002312635 1 NaN NaN NaN NaN NaN
# .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
# 90 1 havana exon 110953 111357 . - . ENSG00000238009 6 AL627309.1 ... lincRNA NaN 5 3 ENSE00001879696 1 NaN NaN NaN NaN NaN
# 91 1 ensembl transcript 120725 133723 . - . ENSG00000238009 6 AL627309.1 ... lincRNA basic 5 NaN NaN NaN NaN NaN NaN NaN NaN
# 92 1 ensembl exon 133374 133723 . - . ENSG00000238009 6 AL627309.1 ... lincRNA basic 5 1 ENSE00003748456 1 NaN NaN NaN NaN NaN
# 93 1 ensembl exon 129055 129223 . - . ENSG00000238009 6 AL627309.1 ... lincRNA basic 5 2 ENSE00003734824 1 NaN NaN NaN NaN NaN
# 94 1 ensembl exon 120874 120932 . - . ENSG00000238009 6 AL627309.1 ... lincRNA basic 5 3 ENSE00003740919 1 NaN NaN NaN NaN NaN
#
# [95 rows x 28 columns]