基于数据帧中的向量和rbind解析多个XML文件

时间:2016-12-09 23:55:42

标签: r xml parsing

通过堆叠器的一些努力和帮助,我能够解析网页并将其保存为数据帧。我想在多个xml文件上重复相同的操作并对列表进行rbind。这是我尝试并成功做到的:

library(XML)    
xml.url <- "http://www.ebi.ac.uk/ena/data/view/ERS445758&display=xml"
doc <- xmlParse(xml.url)
x <- xmlToDataFrame(getNodeSet(doc,"//SAMPLE_ATTRIBUTE"))
x$UNITS <- NULL 
x_t <- t(x) 
x_t <- as.data.frame(x_t)
names(x_t) <- as.matrix(x_t[1, ])
x_t <- x_t[-1, ]
x_t[] <- lapply(x_t, function(x) type.convert(as.character(x)))

以上代码运行良好,现在当我尝试应用函数对多个xml文件执行相同操作时:

ERS_ID <- c("ERS445758","ERS445759", "ERS445760", "ERS445761", "ERS445762")

xml_url_test =    as.vector(sprintf("http://www.ebi.ac.uk/ena/data/view/ERS445758&display=xml",
              ERS_ID))

XML_parser <- function(XML_url){
doc <- xmlParse(XML_url)
x <- xmlToDataFrame(getNodeSet(doc,"//SAMPLE_ATTRIBUTE"))
x$UNITS <- NULL 
x_t <- t(x)
x_t <- as.data.frame(x_t)
names(x_t) <- as.matrix(x_t[1, ])
x_t <- x_t[-1, ]
x_t[] <- lapply(x_t, function(x) type.convert(as.character(x)))
return(x_t)
}

major_test <- sapply(xml_url_test, XML_parser)

它可以工作,但是给了我一个很长的列表,它不是我为单个XML文件生成的正确的数据帧格式。 最后,我还想在最终数据框中添加一列,其中包含ERS_ID向量中的ERS编号 类似函数中的x_t$ERSid <- ERS_ID

有人可以指出我在功能中缺少什么,以及更好的方法来完成任务吗?

谢谢!

4 个答案:

答案 0 :(得分:3)

你的主要问题是使用sapply而不是lapply(),其中后者返回一个列表,之前尝试简化为向量或矩阵,这里是一个矩阵。

major_test <- lapply(xml_url_test, XML_parser)

当然,sapplylapply的包装,也可以返回一个列表:sapply(..., simplify=FALSE)

major_test <- sapply(xml_url_test, XML_parser, simplify=FALSE)

但是,其他一些项目出现了:

  1. 开始时,您没有使用sprintf的%s运算符将 ERS_ID 连接到网址。所以现在,相同的网址正在重复。
  2. 最后,您不会将数据框列表绑定到已编译的最终单个数据帧中。
  3. 在定义的函数中添加新的ERS列,传入 ERS_ID 向量。在创建列时,还要使用gsub删除 ERS ​​前缀。
  4. R 代码(已调整)

    XML_parser <- function(eid) {
      XML_url <- as.vector(sprintf("http://www.ebi.ac.uk/ena/data/view/%s&display=xml", eid))
      doc <- xmlParse(XML_url)
      x <- xmlToDataFrame(getNodeSet(doc,"//SAMPLE_ATTRIBUTE"))
      x$UNITS <- NULL 
      x_t <- t(x)
      x_t <- as.data.frame(x_t)
      names(x_t) <- as.matrix(x_t[1, ])
      x_t <- x_t[-1, ]
      x_t[] <- lapply(x_t, function(x) type.convert(as.character(x)))
      x_t$ERSid <- gsub("ERS", "", eid)         # ADD COL, REMOVE ERS
      x_t <- x_t[,c(ncol(x_t),2:ncol(x_t)-1)]   # MOVE NEW COL TO FIRST
      return(x_t)
    }
    
    major_test <- lapply(ERS_ID, XML_parser)
    # major_test <- sapply(ERS_ID, XML_parser, simplify=FALSE)
    
    # BIND DATA FRAMES TOGETHER
    finaldf <- do.call(rbind, major_test)
    # RESET ROW NAMES
    row.names(finaldf) <- seq(nrow(finaldf))
    

答案 1 :(得分:1)

使用# in your model def start_date self[:start_date].strftime("%D %m %Y") end xml2您可以执行以下操作:

tidyverse

这为您提供了2x36 data.frame。要解析列类型,我建议使用require(xml2) require(purrr) require(tidyr) urls <- rep("http://www.ebi.ac.uk/ena/data/view/ERS445758&display=xml", 2) identifier <- LETTERS[seq_along(urls)] # Take a unique identifier per url here parse_attribute <- function(x){ out <- data.frame(tag = xml_text(xml_find_all(x, "./TAG")), value = xml_text(xml_find_all(x, "./VALUE")), stringsAsFactors = FALSE) spread(out, tag, value) } doc <- map(urls, read_xml) out <- doc %>% map(xml_find_all, "//SAMPLE_ATTRIBUTE") %>% set_names(identifier) %>% map_df(parse_attribute, .id="url")

Out看起来如下:

readr::type_convert(out)

答案 2 :(得分:1)

purrr在这里非常有用,因为您可以使用map或使用at_depth的嵌套元素迭代URL向量或XML文件列表,并简化结果使用*_df表单和flatten

library(tidyverse)
library(xml2)

# be kind, don't call this more times than you need to
x <- c("ERS445758","ERS445759", "ERS445760", "ERS445761", "ERS445762") %>% 
    sprintf("http://www.ebi.ac.uk/ena/data/view/%s&display=xml", .) %>% 
    map(read_xml)    # read each URL into a list item

df <- x %>% map(xml_find_all, '//SAMPLE_ATTRIBUTE') %>%     # for each item select nodes
    at_depth(2, as_list) %>%     # convert each (nested) attribute to list
    map_df(map_df, flatten)    # flatten items, collect pages to df, then all to one df

df
## # A tibble: 175 × 3
##                       TAG                               VALUE UNITS
##                     <chr>                               <chr> <chr>
## 1      investigation type                          metagenome  <NA>
## 2            project name                                BMRP  <NA>
## 3     experimental factor                          microbiome  <NA>
## 4             target gene                            16S rRNA  <NA>
## 5      target subfragment                                V1V2  <NA>
## 6             pcr primers                            27F-338R  <NA>
## 7   multiplex identifiers                          TGATACGTCT  <NA>
## 8       sequencing method                      pyrosequencing  <NA>
## 9  sequence quality check                            software  <NA>
## 10          chimera check ChimeraSlayer; Usearch 4.1 database  <NA>
## # ... with 165 more rows

答案 3 :(得分:0)

您可以使用逗号分隔的列表或范围(如ERS445758-ERS445762)使用单个REST网址retrieve multiple IDs,并避免对ENA进行多次查询。

此代码将所有5个样本放入一个节点集,然后使用xpath字符串中的前导点应用函数,使其相对于该节点。

ERS_ID <- c("ERS445758","ERS445759", "ERS445760", "ERS445761", "ERS445762")
url <- paste0( "http://www.ebi.ac.uk/ena/data/view/", paste(ERS_ID, collapse=","), "&display=xml")  
doc <- xmlParse(url)
samples <- getNodeSet( doc, "//SAMPLE")
## check the first node
samples[[1]]
## get the sample attribute node set and apply xmlToDataFrame to that  
x <- lapply( lapply(samples, getNodeSet,  ".//SAMPLE_ATTRIBUTE"), xmlToDataFrame)
# labels for bind_rows
names(x) <- sapply(samples, xpathSApply, ".//PRIMARY_ID", xmlValue)  
library(dplyr)
y <- bind_rows(x, .id="sample")

z <- subset(y, TAG %in% c("age","sex","body site","body-mass index") , 1:3)
       sample             TAG         VALUE
15  ERS445758             age            28
16  ERS445758             sex          male
17  ERS445758       body site Sigmoid colon
19  ERS445758 body-mass index    16.9550173
50  ERS445759             age            58
51  ERS445759             sex          male
...

library(tidyr)
z %>% spread( TAG, VALUE)
     sample age     body site body-mass index    sex
1 ERS445758  28 Sigmoid colon      16.9550173   male
2 ERS445759  58 Sigmoid colon     23.22543185   male
3 ERS445760  26 Sigmoid colon     20.76124567 female
4 ERS445761  30 Sigmoid colon               0   male
5 ERS445762  36 Sigmoid colon               0   male