我想使用Bioconductor的GenomicFeatures和TxDb.Hsapiens.UCSC.hg19.knownGene R软件包从清单中获取人类基因的坐标(由hgnc基因id组成)。
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb=(TxDb.Hsapiens.UCSC.hg19.knownGene)
my_genes = c("INO80","NASP","INO80D","SMARCA1")
select(txdb, keys = my_genes,
columns=c("TXCHROM","TXSTART","TXEND","TXSTRAND"),
keytype="GENEID")
但是,由于txdb不使用hgnc标识符,因此它不起作用。怎么解决呢?我找不到任何支持hgnc的适当键类型,也不确定如何匹配我拥有的hgnc id和txdb中的GENEID。
答案 0 :(得分:1)
我不熟悉TxDb及其接受/包含的属性类型。
我可以使用biomaRt
包为您提供另一种方法,它也接受hgnc。
library(biomaRt)
my_genes = c("INO80","NASP","INO80D","SMARCA1")
m <- useMart('ensembl', dataset='hsapiens_gene_ensembl') # create a mart object
df <- getBM(mart=m, attributes=c('hgnc_symbol', 'description', 'chromosome_name',
'start_position', 'end_position', 'strand',
'ensembl_gene_id'),
filters='hgnc_symbol', values=my_genes) # where df is a data.frame with all your requested info
它有很多属性可供选择,您可以通过简单的操作来找出它们:
listAttributes(m) # our current dataset
有关更多信息,请检查??biomaRt
希望这会有所帮助。
答案 1 :(得分:1)
因为 txdb 用于成绩单,并且没有(hgnc) geneSymbol ,但是它具有 EntrezID 。
首先,我们需要将 geneSymbol 映射到 EntrezID 。
library(org.Hs.eg.db)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
myGeneSymbols <- select(org.Hs.eg.db,
keys = c("INO80","NASP","INO80D","SMARCA1"),
columns = c("SYMBOL","ENTREZID"),
keytype = "SYMBOL")
# SYMBOL ENTREZID
# 1 INO80 54617
# 2 NASP 4678
# 3 INO80D 54891
# 4 SMARCA1 6594
然后我们可以子集txdb
:
myGeneSymbolsTx <- select(TxDb.Hsapiens.UCSC.hg19.knownGene,
keys = myGeneSymbols$ENTREZID,
columns = c("GENEID", "TXID", "TXCHROM", "TXSTART", "TXEND"),
keytype = "GENEID")
# GENEID TXID TXCHROM TXSTART TXEND
# 1 54617 55599 chr15 41267988 41280172
# 2 54617 55600 chr15 41271079 41408340
# 3 54617 55601 chr15 41271079 41408340
# 4 4678 1229 chr1 46049660 46079853
# 5 4678 1230 chr1 46049660 46081143
# 6 4678 1231 chr1 46049660 46084578
# 7 4678 1232 chr1 46049660 46084578
# 8 4678 1233 chr1 46049660 46084578
# 9 4678 1234 chr1 46067733 46075197
# 10 4678 1235 chr1 46077135 46084578
# 11 54891 12593 chr2 206858445 206950906
# 12 6594 77970 chrX 128580478 128657460
# 13 6594 77971 chrX 128580478 128657460
# 14 6594 77972 chrX 128580740 128657460
# 15 6594 77973 chrX 128580740 128657460
如果需要,我们可以使用merge将 geneSymbol 添加到表中:
res <- merge(myGeneSymbols, myGeneSymbolsTx, by.x = "ENTREZID", by.y = "GENEID")
# ENTREZID SYMBOL TXID TXCHROM TXSTART TXEND
# 1 4678 NASP 1229 chr1 46049660 46079853
# 2 4678 NASP 1230 chr1 46049660 46081143
# 3 4678 NASP 1231 chr1 46049660 46084578
# 4 4678 NASP 1232 chr1 46049660 46084578
# 5 4678 NASP 1233 chr1 46049660 46084578
# 6 4678 NASP 1234 chr1 46067733 46075197
# 7 4678 NASP 1235 chr1 46077135 46084578
# 8 54617 INO80 55599 chr15 41267988 41280172
# 9 54617 INO80 55600 chr15 41271079 41408340
# 10 54617 INO80 55601 chr15 41271079 41408340
# 11 54891 INO80D 12593 chr2 206858445 206950906
# 12 6594 SMARCA1 77970 chrX 128580478 128657460
# 13 6594 SMARCA1 77971 chrX 128580478 128657460
# 14 6594 SMARCA1 77972 chrX 128580740 128657460
# 15 6594 SMARCA1 77973 chrX 128580740 128657460