如何读取R中的xml文件和data.frame

时间:2016-09-25 13:57:24

标签: r xml

library(XML)
file <-"E:/aaa.xml"
doc = xmlInternalTreeParse(file)
ns=names(xmlNamespace(xmlRoot(doc)))
patient=getNodeSet(doc, path=paste("/", ns, ":tcga_bcr/", ns,":patient", sep=""))
row=xmlToDataFrame(nodes=patient, stringsAsFactors = F)

shared_stage:stage_event有许多子节点,如何将每个子节点都精确为列。

如果节点具有preferred_name,请使用preferred_name作为data.frame列名。

aaa.xml:

<?xml version="1.0" encoding="UTF-8"?>
<brca:tcga_bcr xsi:schemaLocation="http://tcga.nci/bcr/xml/clinical/brca/2.7 http://tcga-data.nci.nih.gov/docs/xsd/BCR/tcga.nci/bcr/xml/clinical/brca/2.7/TCGA_BCR.BRCA_Clinical.xsd" schemaVersion="2.7" xmlns:brca="http://tcga.nci/bcr/xml/clinical/brca/2.7" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:admin="http://tcga.nci/bcr/xml/administration/2.7" xmlns:clin_shared="http://tcga.nci/bcr/xml/clinical/shared/2.7" xmlns:shared="http://tcga.nci/bcr/xml/shared/2.7" xmlns:brca_shared="http://tcga.nci/bcr/xml/clinical/brca/shared/2.7" xmlns:shared_stage="http://tcga.nci/bcr/xml/clinical/shared/stage/2.7" xmlns:brca_nte="http://tcga.nci/bcr/xml/clinical/brca/shared/new_tumor_event/2.7/1.0" xmlns:nte="http://tcga.nci/bcr/xml/clinical/shared/new_tumor_event/2.7" xmlns:follow_up_v2.1="http://tcga.nci/bcr/xml/clinical/brca/followup/2.7/2.1" xmlns:rx="http://tcga.nci/bcr/xml/clinical/pharmaceutical/2.7" xmlns:rad="http://tcga.nci/bcr/xml/clinical/radiation/2.7">
<brca:patient>
    <admin:additional_studies/>
    <clin_shared:tumor_tissue_site preferred_name="submitted_tumor_site" display_order="9999" cde="3427536" cde_ver="2.000" xsd_ver="2.6" tier="2" owner="TSS" procurement_status="Completed" restricted="false" source_system_identifier="175314">Breast</clin_shared:tumor_tissue_site>
    <clin_shared:race_list>
        <clin_shared:race preferred_name="race" display_order="12" cde="2192199" cde_ver="1.000" xsd_ver="1.8" tier="2" owner="TSS" procurement_status="Completed" restricted="false" source_system_identifier="175301">WHITE</clin_shared:race>
    </clin_shared:race_list>
    <shared:bcr_patient_barcode preferred_name="" display_order="9999" cde="2673794" cde_ver="" xsd_ver="1.8" owner="TSS" procurement_status="Completed" restricted="false">TCGA-A2-A0EV</shared:bcr_patient_barcode>
    <shared:tissue_source_site cde="" cde_ver="" xsd_ver="2.4" owner="TSS" procurement_status="Completed" restricted="false">A2</shared:tissue_source_site>
    <shared_stage:stage_event system="AJCC">
        <shared_stage:system_version preferred_name="ajcc_staging_edition" display_order="51" cde="2722309" cde_ver="1.000" xsd_ver="2.6" tier="1" owner="TSS" procurement_status="Completed" restricted="false" source_system_identifier="1080001">6th</shared_stage:system_version>
        <shared_stage:tnm_categories>
            <shared_stage:pathologic_categories>
                <shared_stage:pathologic_T preferred_name="ajcc_tumor_pathologic_pt" display_order="52" cde="3045435" cde_ver="1.000" xsd_ver="2.6" tier="1" owner="TSS" procurement_status="Completed" restricted="false" source_system_identifier="175336">T1c</shared_stage:pathologic_T>
            </shared_stage:pathologic_categories>
        </shared_stage:tnm_categories>
    </shared_stage:stage_event>       
    <rx:drugs/>
    <rad:radiations/>
</brca:patient>
</brca:tcga_bcr>

data.frame

submitted_tumor_site  race  bcr_patient_barcode  ajcc_staging_edition ajcc_tumor_pathologic_pt
Breast                WHITE  TCGA-A2-A0EV            6th               T1c

2 个答案:

答案 0 :(得分:1)

由于您具有嵌套的后代和不同的命名空间,因此请考虑将xpaths运行到每个所需的xml值。然后将它们绑定到一个数据帧中。外部lapply()在具有brca:patient函数的checkpath()个节点上运行,以考虑可能丢失的子节点或后代节点:

patientnum <- 1:length(xpathSApply(doc, "//brca:patient"))

checkpath <- function(xpath){
  val <- ifelse(length(xpath) > 0, xpath[[1]], NA)
}

patientdata <- lapply(patientnum, function(i){
  temp <- c(checkpath(xpathSApply(doc, paste0("//brca:patient[",i,"]/clin_shared:tumor_tissue_site"), xmlValue)),
            checkpath(xpathSApply(doc, paste0("//brca:patient[",i,"]/descendant::clin_shared:race"), xmlValue)),
            checkpath(xpathSApply(doc, paste0("//brca:patient[",i,"]/descendant::shared:bcr_patient_barcode"), xmlValue)),
            checkpath(xpathSApply(doc, paste0("//brca:patient[",i,"]/descendant::shared_stage:system_version"), xmlValue)),
            checkpath(xpathSApply(doc, paste0("//brca:patient[",i,"]/descendant::shared_stage:pathologic_T"), xmlValue)))

  temp <- setNames(temp, c("tumor_tissue_site", "race", "bcr_patient_barcode", "system_version", "pathologic_T"))
})

patients <- do.call(rbind, patientdata)
patients <- data.frame(patients, stringsAsFactors = FALSE)

或者,您仍然可以使用xmlToDataFrame(),但需要展平和简化XML,这可以使用XSLT(XML转换语言和兄弟到XPath)来完成。

虽然R没有用于XSLT的专用通用库,但您可以使用外部处理器,包括其他语言(Python,Java,PHP,甚至Excel VBA),专用.exe(Saxon,Xalan)或命令行。口译员(PowerShell,Bash)。 R可以使用system()

调用每个人

XSLT 脚本

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
               xmlns:brca="http://tcga.nci/bcr/xml/clinical/brca/2.7"
               xmlns:clin_shared="http://tcga.nci/bcr/xml/clinical/shared/2.7"
               xmlns:shared="http://tcga.nci/bcr/xml/shared/2.7"              
               xmlns:shared_stage="http://tcga.nci/bcr/xml/clinical/shared/stage/2.7">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>

  <xsl:template match="/brca:tcga_bcr">
    <xsl:element name="{local-name()}">
      <xsl:apply-templates select="brca:patient"/>
    </xsl:element>
  </xsl:template>    

  <xsl:template match="brca:patient">    
    <xsl:element name="{local-name()}">
        <tumor_tissue_site><xsl:value-of select="clin_shared:tumor_tissue_site"/></tumor_tissue_site>
        <race><xsl:value-of select="descendant::clin_shared:race"/></race>
        <bcr_patient_barcode><xsl:value-of select="descendant::shared:bcr_patient_barcode"/></bcr_patient_barcode>
        <system_version><xsl:value-of select="descendant::shared_stage:system_version"/></system_version>
        <pathologic_T><xsl:value-of select="descendant::shared_stage:pathologic_T"/></pathologic_T>
    </xsl:element>
  </xsl:template>

</xsl:transform>

R 脚本

system("command line call to transform xml source with xslt")
# system('python "path/to/transformation_script.py"')          ' EXAMPLE: PYTHON SCRIPT

doc <- xmlParse("path/to/transformed.xml")
doc
# <?xml version="1.0" encoding="UTF-8"?>
# <tcga_bcr>
#   <patient>
#     <tumor_tissue_site>Breast</tumor_tissue_site>
#     <race>WHITE</race>
#     <bcr_patient_barcode>TCGA-A2-A0EV</bcr_patient_barcode>
#     <system_version>6th</system_version>
#     <pathologic_T>T1c</pathologic_T>
#   </patient>
# </tcga_bcr>

patients <- xmlToDataFrame(nodes = getNodeSet(doc, "//patient"), stringsAsFactors = FALSE)

答案 1 :(得分:0)

doc = xmlInternalTreeParse(file)
ns=names(xmlNamespace(xmlRoot(doc)))
patient=getNodeSet(doc, path=paste("/", ns, ":tcga_bcr/", ns,":patient", sep=""))

patient.fields=xmlChildren(patient[[1]])
patient.fields[[2]]

结果是

<clin_shared:tumor_tissue_site preferred_name="submitted_tumor_site" display_order="9999" cde="3427536" cde_ver="2.000" xsd_ver="2.6" tier="2" owner="TSS" procurement_status="Completed" restricted="false" source_system_identifier="175314">Breast</clin_shared:tumor_tissue_site> 

如何在patient.fields [[2]]中抽象出preferred_name的内容?