在R中填充列,其中包含循环遍历数据帧的动态查询结果

时间:2014-04-25 18:09:07

标签: r dynamic web dataframe

我有一个数据框,df:

  Chrom Position Gene.Sym Ref Variant   Lbase   Rbase
1  chr1   888639    NOC2L     T         C  888638  888640
2  chr1   889158    NOC2L     G         C  889157  889159
3  chr1   889159    NOC2L     A         C  889158  889160
4  chr1   982941     AGRN     T         C  982940  982942
5  chr1  1888193 KIAA1751     C         A 1888192 1888194
6  chr1  3319632   PRDM16     G         A 3319631 3319633

我想填充一个新列df $ triplet,其中readLines的[6]结果应用于查询:示例:

> readLines('http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr20:1888192,1888194')
[1] "<?xml version=\"1.0\" standalone=\"no\"?>"                                  
[2] "<!DOCTYPE DASDNA SYSTEM \"http://www.biodas.org/dtd/dasdna.dtd\">"          
[3] "<DASDNA>"                                                                   
[4] "<SEQUENCE id=\"chr20\" start=\"1888192\" stop=\"1888194\" version=\"1.00\">"
[5] "<DNA length=\"3\">"                                                         
[6] "cct"                                                                        
[7] "</DNA>"                                                                     
[8] "</SEQUENCE>"                                                                
[9] "</DASDNA>"

我想把“cct”放在df中,如下所示:

  Chrom Position Gene.Sym Ref.y Variant.y   Lbase   Rbase    triplet
1  chr1   888639    NOC2L     T         C  888638  888640    cct
2  chr1   889158    NOC2L     G         C  889157  889159
3  chr1   889159    NOC2L     A         C  889158  889160
4  chr1   982941     AGRN     T         C  982940  982942
5  chr1  1888193 KIAA1751     C         A 1888192 1888194
6  chr1  3319632   PRDM16     G         A 3319631 3319633

除了我想循环遍历df $ Chrom,df $ Lbase和df $ Rbase中的值,以便填充整个列。我知道这将是类似下面的事情,但我太过高尚无法弄清楚:

baseurl = 'http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment='
myurl = paste(baseurl, trip$Chrom, ":", trip$Lbase, ",", trip$Rbase, sep='')
x = readLines(myurl)

2 个答案:

答案 0 :(得分:1)

您可以使用sapplyreadLines应用于您在myurl中汇总的网址向量,例如将输出添加回数据框:

df$dna <- sapply(myurl, function(url) readLines(url)[6])

答案 1 :(得分:1)

惯用的方法是解析xml:

f <- function(i) {
  library(XML)
  library(stringr)
  x <- trip[i,]
  segment <- paste0(x$Chrom,":",x$Lbase,",",x$Rbase)
  url     <- paste0("http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=",segment)
  doc     <- xmlInternalTreeParse(url)
  return(str_extract(xmlValue(doc["//DNA"][[1]]),"[a-z]+"))
}
trip$triplet=sapply(1:nrow(trip),f)
trip
#   Chrom Position Gene.Sym Ref Variant   Lbase   Rbase triplet
# 1  chr1   888639    NOC2L   T       C  888638  888640     ctt
# 2  chr1   889158    NOC2L   G       C  889157  889159     cga
# 3  chr1   889159    NOC2L   A       C  889158  889160     gaa
# 4  chr1   982941     AGRN   T       C  982940  982942     ctc
# 5  chr1  1888193 KIAA1751   C       A 1888192 1888194     ccg
# 6  chr1  3319632   PRDM16   G       A 3319631 3319633     tgc

如果您的数据框很大(很多行),这可能需要很长时间,您可能会被锁定在服务器之外。最好一次下载多个部分,然后在R中解析,但我不熟悉API。