我有以下信息作为嵌套XML文件,我正尝试将其转换为data.frame进行分析和报告:
<node TEXT="Cost">
<node TEXT="Scale">
<node TEXT="1 - $0 to $100">
</node>
<node TEXT="2 - $100 to $500">
</node>
<node TEXT="3 - $500 to $1000">
</node>
<node TEXT="4 - $1000 to $5000">
</node>
<node TEXT="6 - $5000 +">
</node>
</node>
<node TEXT="Weight">
<node TEXT="1">
</node>
</node>
</node>
我能够读取XML文件并提取一小部分,如下所示:
file <- '<node TEXT="Cost">
<node TEXT="Scale">
<node TEXT="1 - $0 to $100">
</node>
<node TEXT="2 - $100 to $500">
</node>
<node TEXT="3 - $500 to $1000">
</node>
<node TEXT="4 - $1000 to $5000">
</node>
<node TEXT="6 - $5000 +">
</node>
</node>
<node TEXT="Weight">
<node TEXT="1">
</node>
</node>
</node>
'
data <- read_xml(file)
xml_find_all(data,"//node/node[@TEXT = 'Scale']/node/@TEXT")
但是我真正需要做的是以data.frame的形式获取它,如下所示:
Node1 Node2 Node3
"Cost" "Scale" "1 - $0 to $100"
"Cost" "Scale" "2 - $100 to $500"
"Cost" "Scale" "3 - $500 to $1000"
"Cost" "Scale" "4 - $1000 to $5000"
"Cost" "Scale" "5 - $5000 +"
"Cost" "Weight" "1"
有人可以指出我正确的方向吗?
答案 0 :(得分:2)
除了使用xslt
,您还可以仅遍历节点列表。在这里,我们选择三个节点的所有节点,然后从所有父节点中提取TEXT属性(最后将其与dplyr
绑定在一起)
library(dplyr)
xml_find_all(doc,"//node/node/node") %>% lapply(function(x) {
list(
NODE1=x %>% xml_parent %>% xml_parent %>% xml_attr("TEXT"),
NODE2=x %>% xml_parent %>% xml_attr("TEXT"),
NODE3=x %>% xml_attr("TEXT")
)
}) %>% bind_rows()
答案 1 :(得分:0)
为了重塑XML,我喜欢使用XSLT,它是一种通用的XML转换语言。 R包xslt
允许您使用R中的XSLT转换XML文件。
在这种情况下,您可以将其转换为HTML表,可以轻松地使用rvest
进行解析:
library(tidyverse)
library(xslt)
library(rvest)
file <- '<node TEXT="Cost">
<node TEXT="Scale">
<node TEXT="1 - $0 to $100">
</node>
<node TEXT="2 - $100 to $500">
</node>
<node TEXT="3 - $500 to $1000">
</node>
<node TEXT="4 - $1000 to $5000">
</node>
<node TEXT="6 - $5000 +">
</node>
</node>
<node TEXT="Weight">
<node TEXT="1">
</node>
</node>
</node>
'
data <- read_xml(file)
xslt <- '<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:template match="/">
<html><table><xsl:apply-templates select="node/node/node"/></table></html>
</xsl:template>
<xsl:template match="node">
<tr>
<td><xsl:value-of select="../../@TEXT"/></td>
<td><xsl:value-of select="../@TEXT"/></td>
<td><xsl:value-of select="@TEXT"/></td>
</tr>
</xsl:template>
</xsl:stylesheet>'
style <- read_xml(xslt)
xml_xslt(data, style) %>%
rvest::html_table() %>%
.[[1]]
#> X1 X2 X3
#> 1 Cost Scale 1 - $0 to $100
#> 2 Cost Scale 2 - $100 to $500
#> 3 Cost Scale 3 - $500 to $1000
#> 4 Cost Scale 4 - $1000 to $5000
#> 5 Cost Scale 6 - $5000 +
#> 6 Cost Weight 1