我有一个有点复杂的xml文档,我需要帮助将其转换为数据框。看起来像这样(显然实际文档要大得多):
<?xml version="1.0" encoding="utf-8"?>
<ImportFile xmlns="urn:ImportFile-schema">
<HEADERVERSION>3.3</HEADERVERSION>
<MESSAGETYPE>Import</MESSAGETYPE>
<DESTINATIONORI>IL*****</DESTINATIONORI>
<SOURCELAB>IL***** </SOURCELAB>
<SUBMITBYUSERID>123456789</SUBMITBYUSERID>
<SUBMITDATETIME>2020-07-12T18:31:00</SUBMITDATETIME>
<SPECIMEN SOURCEID="Yes" CASEID="UNKNOWN" PARTIAL="false">
<SPECIMENID>1234567</SPECIMENID>
<SPECIMENCATEGORY>Known</SPECIMENCATEGORY>
<SPECIMENCOMMENT>4</SPECIMENCOMMENT>
<LOCUS BATCHID="EXPORT" PARTIALLOCUS="false" KIT="PowerPlex ESI 16">
<LOCUSNAME>D16S539</LOCUSNAME>
<READINGBY>Lab</READINGBY>
<READINGDATETIME>2016-05-23T10:24:00</READINGDATETIME>
<ALLELE>
<ALLELEVALUE>9</ALLELEVALUE>
</ALLELE>
<ALLELE>
<ALLELEVALUE>12.3</ALLELEVALUE>
</ALLELE>
</LOCUS>
<LOCUS BATCHID="EXPORT" PARTIALLOCUS="false" KIT="PowerPlex ESI 16">
<LOCUSNAME>D1S1656</LOCUSNAME>
<READINGBY>Lab</READINGBY>
<READINGDATETIME>2016-05-23T10:24:00</READINGDATETIME>
<ALLELE>
<ALLELEVALUE>12</ALLELEVALUE>
</ALLELE>
<ALLELE>
<ALLELEVALUE>15</ALLELEVALUE>
</ALLELE>
</LOCUS>
</SPECIMEN>
<SPECIMEN SOURCEID="Yes" CASEID="UNKNOWN" PARTIAL="false">
<SPECIMENID>9876543</SPECIMENID>
<SPECIMENCATEGORY>Known</SPECIMENCATEGORY>
<SPECIMENCOMMENT>4</SPECIMENCOMMENT>
<LOCUS BATCHID="EXPORT" PARTIALLOCUS="false" KIT="PowerPlex ESI 16">
<LOCUSNAME>D16S539</LOCUSNAME>
<READINGBY>Lab</READINGBY>
<READINGDATETIME>2016-03-17T08:50:00</READINGDATETIME>
<ALLELE>
<ALLELEVALUE>11</ALLELEVALUE>
</ALLELE>
</LOCUS>
<LOCUS BATCHID="EXPORT" PARTIALLOCUS="false" KIT="PowerPlex ESI 16">
<LOCUSNAME>D1S1656</LOCUSNAME>
<READINGBY>Lab</READINGBY>
<READINGDATETIME>2016-03-17T08:50:00</READINGDATETIME>
<ALLELE>
<ALLELEVALUE>14</ALLELEVALUE>
</ALLELE>
<ALLELE>
<ALLELEVALUE>17.3</ALLELEVALUE>
</ALLELE>
</LOCUS>
</SPECIMEN>
</ImportFile>
最后,我希望数据帧中的每一行包含一个SPECIMENID,每列包含一个LOCUSNAME,如示例所示:
SPECIMENID D16S539 D1S1656
1234567 9, 12.3 12, 15
9876543 11 14, 17.3
我尝试了以下操作:
v<-xmlToDataFrame("filename.xml")
但出现错误:
Error in `[<-.data.frame`(`*tmp*`, i, names(nodes[[i]]), value = c(SPECIMENID = "1234567", :
duplicate subscripts for columns
答案 0 :(得分:0)
xmlToDataFrame似乎不喜欢重复的节点。最好的选择似乎是软件包xml2,并嵌套使用函数purrr::map。我认为,要使XPATH正常工作,您还必须摆脱名称空间的限制,因为实际上没有节点可以限制名称空间:
library(tidyverse)
doc <- xml2::read_xml("filename.xml")
xml2::xml_ns_strip(doc)
specimen <- xml2::xml_find_all(doc, "SPECIMEN") %>% purrr::map_dfr(function(x) {
specimenid <- xml2::xml_find_all(x, "./SPECIMENID") %>% xml2::xml_text()
locus <- xml2::xml_find_all(x, "./LOCUS") %>% purrr::map_dfr(function(y) {
locusname <- xml2::xml_find_all(y, "./LOCUSNAME") %>% xml2::xml_text()
allele <- xml2::xml_find_all(y, "./ALLELE/ALLELEVALUE") %>% xml2::xml_text()
dplyr::tibble(locusname, allele)
})
dplyr::tibble(specimenid, locus)
}) %>% tidyr::pivot_wider(names_from=locusname, values_from=allele)
我可能无法很好地解释它,但是基本上,为了能够将ALLELEVALUE映射到要嵌套data.frame的SPECIMENID和LOCUSNAME,以便将ALLELEVALUE映射到LOCUSNAME,然后将内部data.frame映射到标本我认为您不能同时执行两个操作。然后,数据透视表会将整个data.frame带入您指定的形式,可能不列出分组的ALLELEVALUE。