我有一个数据框,df:
Chrom Position Gene.Sym Ref Variant Lbase Rbase
1 chr1 888639 NOC2L T C 888638 888640
2 chr1 889158 NOC2L G C 889157 889159
3 chr1 889159 NOC2L A C 889158 889160
4 chr1 982941 AGRN T C 982940 982942
5 chr1 1888193 KIAA1751 C A 1888192 1888194
6 chr1 3319632 PRDM16 G A 3319631 3319633
我想填充一个新列df $ triplet,其中readLines的[6]结果应用于查询:示例:
> readLines('http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr20:1888192,1888194')
[1] "<?xml version=\"1.0\" standalone=\"no\"?>"
[2] "<!DOCTYPE DASDNA SYSTEM \"http://www.biodas.org/dtd/dasdna.dtd\">"
[3] "<DASDNA>"
[4] "<SEQUENCE id=\"chr20\" start=\"1888192\" stop=\"1888194\" version=\"1.00\">"
[5] "<DNA length=\"3\">"
[6] "cct"
[7] "</DNA>"
[8] "</SEQUENCE>"
[9] "</DASDNA>"
我想把“cct”放在df中,如下所示:
Chrom Position Gene.Sym Ref.y Variant.y Lbase Rbase triplet
1 chr1 888639 NOC2L T C 888638 888640 cct
2 chr1 889158 NOC2L G C 889157 889159
3 chr1 889159 NOC2L A C 889158 889160
4 chr1 982941 AGRN T C 982940 982942
5 chr1 1888193 KIAA1751 C A 1888192 1888194
6 chr1 3319632 PRDM16 G A 3319631 3319633
除了我想循环遍历df $ Chrom,df $ Lbase和df $ Rbase中的值,以便填充整个列。我知道这将是类似下面的事情,但我太过高尚无法弄清楚:
baseurl = 'http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment='
myurl = paste(baseurl, trip$Chrom, ":", trip$Lbase, ",", trip$Rbase, sep='')
x = readLines(myurl)
答案 0 :(得分:1)
您可以使用sapply
将readLines
应用于您在myurl
中汇总的网址向量,例如将输出添加回数据框:
df$dna <- sapply(myurl, function(url) readLines(url)[6])
答案 1 :(得分:1)
惯用的方法是解析xml:
f <- function(i) {
library(XML)
library(stringr)
x <- trip[i,]
segment <- paste0(x$Chrom,":",x$Lbase,",",x$Rbase)
url <- paste0("http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=",segment)
doc <- xmlInternalTreeParse(url)
return(str_extract(xmlValue(doc["//DNA"][[1]]),"[a-z]+"))
}
trip$triplet=sapply(1:nrow(trip),f)
trip
# Chrom Position Gene.Sym Ref Variant Lbase Rbase triplet
# 1 chr1 888639 NOC2L T C 888638 888640 ctt
# 2 chr1 889158 NOC2L G C 889157 889159 cga
# 3 chr1 889159 NOC2L A C 889158 889160 gaa
# 4 chr1 982941 AGRN T C 982940 982942 ctc
# 5 chr1 1888193 KIAA1751 C A 1888192 1888194 ccg
# 6 chr1 3319632 PRDM16 G A 3319631 3319633 tgc
如果您的数据框很大(很多行),这可能需要很长时间,您可能会被锁定在服务器之外。最好一次下载多个部分,然后在R中解析,但我不熟悉API。