R:将复杂的XML转换为数据框

时间:2020-07-29 16:16:25

标签: r xml

我有一个有点复杂的xml文档,我需要帮助将其转换为数据框。看起来像这样(显然实际文档要大得多):

<?xml version="1.0" encoding="utf-8"?>
<ImportFile xmlns="urn:ImportFile-schema">
  <HEADERVERSION>3.3</HEADERVERSION>
  <MESSAGETYPE>Import</MESSAGETYPE>
  <DESTINATIONORI>IL*****</DESTINATIONORI>
  <SOURCELAB>IL***** </SOURCELAB>
  <SUBMITBYUSERID>123456789</SUBMITBYUSERID>
  <SUBMITDATETIME>2020-07-12T18:31:00</SUBMITDATETIME>
  <SPECIMEN SOURCEID="Yes" CASEID="UNKNOWN" PARTIAL="false">
    <SPECIMENID>1234567</SPECIMENID>
    <SPECIMENCATEGORY>Known</SPECIMENCATEGORY>
    <SPECIMENCOMMENT>4</SPECIMENCOMMENT>
    <LOCUS BATCHID="EXPORT" PARTIALLOCUS="false" KIT="PowerPlex ESI 16">
      <LOCUSNAME>D16S539</LOCUSNAME>
      <READINGBY>Lab</READINGBY>
      <READINGDATETIME>2016-05-23T10:24:00</READINGDATETIME>
      <ALLELE>
        <ALLELEVALUE>9</ALLELEVALUE>
      </ALLELE>
      <ALLELE>
        <ALLELEVALUE>12.3</ALLELEVALUE>
      </ALLELE>
    </LOCUS>
    <LOCUS BATCHID="EXPORT" PARTIALLOCUS="false" KIT="PowerPlex ESI 16">
      <LOCUSNAME>D1S1656</LOCUSNAME>
      <READINGBY>Lab</READINGBY>
      <READINGDATETIME>2016-05-23T10:24:00</READINGDATETIME>
      <ALLELE>
        <ALLELEVALUE>12</ALLELEVALUE>
      </ALLELE>
      <ALLELE>
        <ALLELEVALUE>15</ALLELEVALUE>
      </ALLELE>
    </LOCUS>
  </SPECIMEN>
  <SPECIMEN SOURCEID="Yes" CASEID="UNKNOWN" PARTIAL="false">
    <SPECIMENID>9876543</SPECIMENID>
    <SPECIMENCATEGORY>Known</SPECIMENCATEGORY>
    <SPECIMENCOMMENT>4</SPECIMENCOMMENT>
    <LOCUS BATCHID="EXPORT" PARTIALLOCUS="false" KIT="PowerPlex ESI 16">
      <LOCUSNAME>D16S539</LOCUSNAME>
      <READINGBY>Lab</READINGBY>
      <READINGDATETIME>2016-03-17T08:50:00</READINGDATETIME>
      <ALLELE>
        <ALLELEVALUE>11</ALLELEVALUE>
      </ALLELE>
    </LOCUS>    
    <LOCUS BATCHID="EXPORT" PARTIALLOCUS="false" KIT="PowerPlex ESI 16">
      <LOCUSNAME>D1S1656</LOCUSNAME>
      <READINGBY>Lab</READINGBY>
      <READINGDATETIME>2016-03-17T08:50:00</READINGDATETIME>
      <ALLELE>
        <ALLELEVALUE>14</ALLELEVALUE>
      </ALLELE>
      <ALLELE>
        <ALLELEVALUE>17.3</ALLELEVALUE>
      </ALLELE>
    </LOCUS>
  </SPECIMEN>
</ImportFile> 

最后,我希望数据帧中的每一行包含一个SPECIMENID,每列包含一个LOCUSNAME,如示例所示:

SPECIMENID  D16S539  D1S1656 
1234567     9, 12.3  12, 15 
9876543     11       14, 17.3

我尝试了以下操作:

v<-xmlToDataFrame("filename.xml")

但出现错误:

Error in `[<-.data.frame`(`*tmp*`, i, names(nodes[[i]]), value = c(SPECIMENID = "1234567",  : 
  duplicate subscripts for columns

1 个答案:

答案 0 :(得分:0)

特别是

xmlToDataFrame似乎不喜欢重复的节点。最好的选择似乎是软件包xml2,并嵌套使用函数purrr::map。我认为,要使XPATH正常工作,您还必须摆脱名称空间的限制,因为实际上没有节点可以限制名称空间:

library(tidyverse)
doc <- xml2::read_xml("filename.xml")
xml2::xml_ns_strip(doc)
specimen <- xml2::xml_find_all(doc, "SPECIMEN") %>% purrr::map_dfr(function(x) {
  specimenid <- xml2::xml_find_all(x, "./SPECIMENID") %>% xml2::xml_text()
  locus <- xml2::xml_find_all(x, "./LOCUS") %>% purrr::map_dfr(function(y) {
    locusname <- xml2::xml_find_all(y, "./LOCUSNAME") %>% xml2::xml_text()
    allele <- xml2::xml_find_all(y, "./ALLELE/ALLELEVALUE") %>% xml2::xml_text()
    dplyr::tibble(locusname, allele)
  })
  dplyr::tibble(specimenid, locus)
}) %>% tidyr::pivot_wider(names_from=locusname, values_from=allele)

我可能无法很好地解释它,但是基本上,为了能够将ALLELEVALUE映射到要嵌套data.frame的SPECIMENID和LOCUSNAME,以便将ALLELEVALUE映射到LOCUSNAME,然后将内部data.frame映射到标本我认为您不能同时执行两个操作。然后,数据透视表会将整个data.frame带入您指定的形式,可能不列出分组的ALLELEVALUE。