我有一个xml文件,我想使用xmlToDataFrame
包中的XML
从R中提取特定节点。我可以获得从单个节点提取数据的功能。例如:
xml <- xmlParse("file.xml")
df <- xmlToDataFrame(getNodeSet(xml, "//lat"))
但是我想知道它是否可以同时提取多个节点?具体来说,我希望从一个节点中提取五列数据帧://nucleotides
,//lat
,//lon
,//bin_uri
,//record_id
来自xml。
xml文件的结构如下(只有一个record_id
但文件中有许多我需要提取的内容:
<record>
<record_id>634750</record_id>
<processid>CCSMA054-07</processid>
<bin_uri>AAG2098</bin_uri>
<collection_event>
<collectors>Arctic Ecology</collectors>
<coordinates>
<lat>58.805</lat>
<lon>-94.214</lon>
</coordinates>
<country>Canada</country>
<province>Manitoba</province>
</collection_event>
<sequences>
<sequence>
<sequenceID>3336699</sequenceID>
<markercode>COI-5P</markercode>
<genbank_accession>HQ938393</genbank_accession>
<nucleotides>CTCAGAGTTCTCACCTGGC</nucleotides>
</sequence>
</sequences>
</record>
答案 0 :(得分:1)
考虑使用xpathSApply()
简单地运行各种xpath表达式,然后将所有表达式绑定到数据框中:
library(XML)
doc<-xmlParse("D:/Freelance Work/Scripts/BoldXML.xml")
record_id <- xpathSApply(doc, "//record/record_id", xmlValue)
bin_uri <- xpathSApply(doc, "//record/bin_uri", xmlValue)
lat <- xpathSApply(doc, "//record/collection_event/coordinates/lat", xmlValue)
lon <- xpathSApply(doc, "//record/collection_event/coordinates/lon", xmlValue)
nucleotides <- xpathSApply(doc, "//record/sequences/sequence/nucleotides", xmlValue)
df <- data.frame(record_id = unlist(record_id),
bin_uri = unlist(bin_uri),
lat = unlist(lat),
lng = unlist(lon),
nucleotides = unlist(nucleotides))
或者,您可以使用XSLT来简化原始XML,system()是重构/重新设计XML文件的专用语言。虽然R没有通用的XSLT软件包,但实际上所有通用语言(C#,Java,PHP,Perl,Python,VB)都维护着XSLT库,您甚至可以使用从R调用脚本。更重要的是,命令行程序,如Windows&#39; PowerShell和Linux的Bash可以运行XSLT。
XSLT 脚本(另存为.xsl或.xslt)
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<root>
<xsl:apply-templates select="*"/>
</root>
</xsl:template>
<xsl:template match="record">
<xsl:copy>
<xsl:copy-of select="record_id"/>
<xsl:copy-of select="bin_uri"/>
<xsl:copy-of select="collection_event/coordinates/lat"/>
<xsl:copy-of select="collection_event/coordinates/lon"/>
<xsl:copy-of select="sequences/sequence/nucleotides"/>
</xsl:copy>
</xsl:template>
</xsl:transform>
XML (转化后)
<?xml version="1.0" encoding="utf-8"?>
<root>
<record>
<record_id>634750</record_id>
<bin_uri>AAG2098</bin_uri>
<lat>58.805</lat>
<lon>-94.214</lon>
<nucleotides>CTCAGAGTTCTCACCTGGC</nucleotides>
</record>
</root>
R 脚本:
result <- system('..some command line call to an external script that
parses original xml and above xslt script and transforms
former with the latter..', intern = TRUE)
doc <- xmlParse("C:/Path/To/Transformed/XML.xml")
df <- xmlToDataFrame(getNodeSet(doc, "//record"))
答案 1 :(得分:0)
与上一个答案一样,带有 xpath 的 getNodeSet
是另一种快速获取所需值的方法。如果每个 xpath 都有一个节点,您可以使用:
library(XML)
doc <- xmlParse("D:/Freelance Work/Scripts/BoldXML.xml")
record_id <- xmlValue(getNodeSet(doc, "//record/record_id"))
bin_uri <- xmlValue(getNodeSet(doc, "//record/bin_uri"))
lat <- xmlValue(getNodeSet(doc, "//record/collection_event/coordinates/lat"))
lon <- xmlValue(getNodeSet(doc, "//record/collection_event/coordinates/lon"))
ntides <- xmlValue(getNodeSet(doc, "//record/sequences/sequence/nucleotides"))