从xml文件中提取特定内部节点并在r

时间:2016-02-03 00:08:16

标签: r xml-parsing

我有一个xml文件,我想使用xmlToDataFrame包中的XML从R中提取特定节点。我可以获得从单个节点提取数据的功能。例如:

xml <- xmlParse("file.xml")

df <- xmlToDataFrame(getNodeSet(xml, "//lat"))

但是我想知道它是否可以同时提取多个节点?具体来说,我希望从一个节点中提取五列数据帧://nucleotides//lat//lon//bin_uri//record_id来自xml。

xml文件的结构如下(只有一个record_id但文件中有许多我需要提取的内容:

    <record>
      <record_id>634750</record_id>
      <processid>CCSMA054-07</processid>
      <bin_uri>AAG2098</bin_uri>
      <collection_event>
        <collectors>Arctic Ecology</collectors>
          <coordinates>
            <lat>58.805</lat>
            <lon>-94.214</lon>
          </coordinates>
        <country>Canada</country>
        <province>Manitoba</province>
      </collection_event>
      <sequences>
       <sequence>
         <sequenceID>3336699</sequenceID>
         <markercode>COI-5P</markercode>
         <genbank_accession>HQ938393</genbank_accession>
         <nucleotides>CTCAGAGTTCTCACCTGGC</nucleotides>
       </sequence>
      </sequences>
    </record>

2 个答案:

答案 0 :(得分:1)

考虑使用xpathSApply()简单地运行各种xpath表达式,然后将所有表达式绑定到数据框中:

library(XML)

doc<-xmlParse("D:/Freelance Work/Scripts/BoldXML.xml")

record_id <- xpathSApply(doc, "//record/record_id", xmlValue)
bin_uri <- xpathSApply(doc, "//record/bin_uri", xmlValue)
lat <- xpathSApply(doc, "//record/collection_event/coordinates/lat", xmlValue)
lon <- xpathSApply(doc, "//record/collection_event/coordinates/lon", xmlValue)
nucleotides <- xpathSApply(doc, "//record/sequences/sequence/nucleotides", xmlValue)

df <- data.frame(record_id = unlist(record_id), 
                 bin_uri = unlist(bin_uri),                  
                 lat = unlist(lat),
                 lng = unlist(lon),
                 nucleotides = unlist(nucleotides))

或者,您可以使用XSLT来简化原始XML,system()是重构/重新设计XML文件的专用语言。虽然R没有通用的XSLT软件包,但实际上所有通用语言(C#,Java,PHP,Perl,Python,VB)都维护着XSLT库,您甚至可以使用enter image description here从R调用脚本。更重要的是,命令行程序,如Windows&#39; PowerShell和Linux的Bash可以运行XSLT。

XSLT 脚本(另存为.xsl或.xslt)

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>

  <xsl:template match="/">
    <root>
      <xsl:apply-templates select="*"/>
    </root>
  </xsl:template>  

  <xsl:template match="record">
    <xsl:copy>
      <xsl:copy-of select="record_id"/>
      <xsl:copy-of select="bin_uri"/>     
      <xsl:copy-of select="collection_event/coordinates/lat"/>
      <xsl:copy-of select="collection_event/coordinates/lon"/>
      <xsl:copy-of select="sequences/sequence/nucleotides"/>
    </xsl:copy>
  </xsl:template>

</xsl:transform>

XML (转化后)

<?xml version="1.0" encoding="utf-8"?>
<root>
  <record>
    <record_id>634750</record_id>
    <bin_uri>AAG2098</bin_uri>
    <lat>58.805</lat>
    <lon>-94.214</lon>
    <nucleotides>CTCAGAGTTCTCACCTGGC</nucleotides>
  </record>
</root>

R 脚本:

result <- system('..some command line call to an external script that 
                  parses original xml and above xslt script and transforms
                  former with the latter..', intern = TRUE)

doc <- xmlParse("C:/Path/To/Transformed/XML.xml")
df <- xmlToDataFrame(getNodeSet(doc, "//record"))

答案 1 :(得分:0)

与上一个答案一样,带有 xpath 的 getNodeSet 是另一种快速获取所需值的方法。如果每个 xpath 都有一个节点,您可以使用:

library(XML)

doc <- xmlParse("D:/Freelance Work/Scripts/BoldXML.xml")

record_id <- xmlValue(getNodeSet(doc, "//record/record_id"))
bin_uri   <- xmlValue(getNodeSet(doc, "//record/bin_uri"))
lat       <- xmlValue(getNodeSet(doc, "//record/collection_event/coordinates/lat"))
lon       <- xmlValue(getNodeSet(doc, "//record/collection_event/coordinates/lon"))
ntides    <- xmlValue(getNodeSet(doc, "//record/sequences/sequence/nucleotides"))