R - 如何使用正确的结构将XML转换为R中的数据帧?

时间:2015-12-14 17:30:47

标签: xml r

我想将XML文件转换为数据帧。我找到了一些允许我读取XML数据的函数,但是我无法获得与初始XML文件具有相同结构的数据框(=在Excel中打开XML文件时将获得的结构)。 / p>

这是我原来的XML代码:

<Data>
<Frame timestamp='17/09/2014  20:55:00.902' timecode='75299902' >
<Object type='Taxi' DISTANCE='3037' VOLUME='1668' id='15593' code='0' />
<Object type='Taxi' DISTANCE='3605' VOLUME='931' id='15603' code='4' />
<Object type='Bus' DISTANCE='3563' VOLUME='488' id='15604' code='9' />
<Object type='Taxi' DISTANCE='4942' VOLUME='57' id='15624' code='1' />
<Object type='Taxi' DISTANCE='784' VOLUME='47' id='15625' code='10' />
<Object type='Taxi' DISTANCE='3301' VOLUME='2041' id='15626' code='42' />
<Object type='Bus' DISTANCE='2040' VOLUME='2945' id='15630' code='27' />
<Object type='Airplane' DISTANCE='2865' VOLUME='2722' Z='0' />
</Frame>
<TrackingFrame timestamp='17/09/2014 20:54:59.771' timecode='75299771' >
<Object type='Taxi' DISTANCE='4941' VOLUME='51' id='15624' code='1' />
<Object type='Taxi' DISTANCE='789' VOLUME='47' id='15625' code='10' />
<Object type='Taxi' DISTANCE='3300' VOLUME='2069' id='15626' code='42' />
<Object type='Bus' DISTANCE='2027' VOLUME='2947' id='15630' code='27' />
<Object type='Airplane' DISTANCE='2865' VOLUME='2722' Z='0' />
</Frame>
</Data>

这允许我已经获得数据列表: 库(XML)

# Convert xml data to R
data <- xmlTreeParse(file="c:/R/CL/filename.xml",useInternalNode=TRUE)
# Create a list of the data
xl<-xmlToList(data)

理想情况下,我希望获得基于此XML数据的数据框,该数据框与在Excel中输入XML数据时的数据框相同。但是,当我查看xl的输出时,我发现它是在Objects和Times中组织的。通常,当我在Excel中打开XML文件时,此信息被链接(并且每个对象也包含具有时间信息的列)

这是xl&lt; -xmlToList(data)的输出:

$Frame$Object
     type         DISTANCE         VOLUME        id       code 
"Taxi"    "3037"    "1668"   "15593"       "0" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"  "3605"   "931" "15603"     "4" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 “Bus”  "3563"   "488" "15604"     "9" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"  "2161"  "1592" "15615"    "21" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"  "4942"    "57" "15624"     "1" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"   "784"    "47" "15625"    "10" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"  "3301"  "2041" "15626"    "42" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 “Bus”  "2040"  "2945" "15630"    "27" 


$Frame$Object
  type      DISTANCE      VOLUME      Z 
"Airplane" "2865" "2722"    "0" 

$Frame$Time
                timestamp                  timecode 
"17/09/2014 20:54:59.902"                "75299902"

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"  "4941"    "51" "15624"     "1" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"   "789"    "47" "15625"    "10" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"  "3300"  "2069" "15626"    "42" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 “Bus”  "2027"  "2947" "15630"    "27" 

$Frame$Object
  type      DISTANCE      VOLUME      Z 
"Airplane" "2865" "2722"    "0" 

$Frame$Time
                timestamp                  timecode 
"17/09/2014 20:54:59.771"                "75299771"

此列表包含2个表结构/框架:Frame $ Object和Frame $ Time。我想将这两个结构组合成一个组合表(通过重复列时间戳和时间码以及每个对象的时间信息)。

请参阅下面的所需输出(与您在Excel中输入XML文件时的结构相同):

type    DISTANCE    VOLUME  id  code    z   timestamp   timecode
Taxi    3037    1668    15593   0       17/09/2014 20:54:59.902 75299902
Taxi    3605    931 15603   4       17/09/2014 20:54:59.902 75299902
Bus 3563    488 15604   9       17/09/2014 20:54:59.900 75299902
Taxi    4942    57  15624   1       17/09/2014 20:54:59.900 75299902
Taxi    784 47  15625   10      17/09/2014 20:54:59.900 75299902
Taxi    3301    2041    15626   42      17/09/2014 20:54:59.900 75299902
Bus 2040    2945    15630   27      17/09/2014 20:54:59.900 75299902
Airplane    2865    2722            0   17/09/2014 20:54:59.900 75299902
Taxi    4941    51  15624   1        17/09/2014 20:54:59.771    75299771
Taxi    789 47  15625   10       17/09/2014 20:54:59.771    75299771
Taxi    3300    2069    15626   42       17/09/2014 20:54:59.771    75299771
Bus 2027    2947    15630   27       17/09/2014 20:54:59.771    75299771
Airplane    2865    2722            0    17/09/2014 20:54:59.771    75299771

哪些功能可以达到这个效果?先谢谢你的帮助!

3 个答案:

答案 0 :(得分:3)

您可以使用xml2dplyr进行快速转换:

library(xml2)
library(dplyr)

dat <- "<Data>
<Frame timestamp='17/09/2014  20:55:00.902' timecode='75299902' >
<Object type='Taxi' DISTANCE='3037' VOLUME='1668' id='15593' code='0' />
<Object type='Taxi' DISTANCE='3605' VOLUME='931' id='15603' code='4' />
<Object type='Bus' DISTANCE='3563' VOLUME='488' id='15604' code='9' />
<Object type='Taxi' DISTANCE='4942' VOLUME='57' id='15624' code='1' />
<Object type='Taxi' DISTANCE='784' VOLUME='47' id='15625' code='10' />
<Object type='Taxi' DISTANCE='3301' VOLUME='2041' id='15626' code='42' />
<Object type='Bus' DISTANCE='2040' VOLUME='2945' id='15630' code='27' />
<Object type='Airplane' DISTANCE='2865' VOLUME='2722' Z='0' />
</Frame>
<Frame timestamp='17/09/2014 20:54:59.771' timecode='75299771' >
<Object type='Taxi' DISTANCE='4941' VOLUME='51' id='15624' code='1' />
<Object type='Taxi' DISTANCE='789' VOLUME='47' id='15625' code='10' />
<Object type='Taxi' DISTANCE='3300' VOLUME='2069' id='15626' code='42' />
<Object type='Bus' DISTANCE='2027' VOLUME='2947' id='15630' code='27' />
<Object type='Airplane' DISTANCE='2865' VOLUME='2722' Z='0' />
</Frame>
</Data>"

doc <- read_xml(dat)

# bind the data.frames built in the iterator together
bind_rows(lapply(xml_find_all(doc, "//Frame"), function(x) {

  # extract the attributes from the parent tag as a data.frame
  parent <- data.frame(as.list(xml_attrs(x)), stringsAsFactors=FALSE)

  # make a data.frame out of the attributes of the kids
  kids <- bind_rows(lapply(xml_children(x), function(x) as.list(xml_attrs(x))))

  # combine them
  cbind.data.frame(parent, kids, stringsAsFactors=FALSE)

}))

## Source: local data frame [13 x 8]
## 
##                   timestamp timecode     type DISTANCE VOLUME    id  code     Z
##                       (chr)    (chr)    (chr)    (chr)  (chr) (chr) (chr) (chr)
## 1  17/09/2014  20:55:00.902 75299902     Taxi     3037   1668 15593     0    NA
## 2  17/09/2014  20:55:00.902 75299902     Taxi     3605    931 15603     4    NA
## 3  17/09/2014  20:55:00.902 75299902      Bus     3563    488 15604     9    NA
## 4  17/09/2014  20:55:00.902 75299902     Taxi     4942     57 15624     1    NA
## 5  17/09/2014  20:55:00.902 75299902     Taxi      784     47 15625    10    NA
## 6  17/09/2014  20:55:00.902 75299902     Taxi     3301   2041 15626    42    NA
## 7  17/09/2014  20:55:00.902 75299902      Bus     2040   2945 15630    27    NA
## 8  17/09/2014  20:55:00.902 75299902 Airplane     2865   2722    NA    NA     0
## 9   17/09/2014 20:54:59.771 75299771     Taxi     4941     51 15624     1    NA
## 10  17/09/2014 20:54:59.771 75299771     Taxi      789     47 15625    10    NA
## 11  17/09/2014 20:54:59.771 75299771     Taxi     3300   2069 15626    42    NA
## 12  17/09/2014 20:54:59.771 75299771      Bus     2027   2947 15630    27    NA
## 13  17/09/2014 20:54:59.771 75299771 Airplane     2865   2722    NA    NA     0

您需要根据需要转换类型。

如果你坚持使用XML套餐,你可以做类似的事情:

doc <- xmlParse(dat)

bind_rows(xpathApply(doc, "//Frame", function(x) {
  parent <- data.frame(as.list(xmlAttrs(x)), stringsAsFactors=FALSE)
  kids <- bind_rows(lapply(xmlChildren(x), function(x) as.list(xmlAttrs(x))))
  cbind.data.frame(parent, kids, stringsAsFactors=FALSE)
}))

答案 1 :(得分:0)

尝试

data <- xmlParse(file="c:/R/CL/filename.xml")

等等:

sapply(getNodeSet(data, "//Frame/Object[@type]"), xmlValue)

它应该为您提供节点Frame下所有类型的节点对象的向量。 更多信息: http://www.w3schools.com/xsl/xpath_syntax.asp

答案 2 :(得分:0)

考虑savedComp: Component = null; ... if (this.savedComp) { this.savedComp.dispose(); } this.loader.loadIntoLocation(DynamicComponent, this.element, 'attach') then((res) => {res.instance.model = model; this.savedComp = res;}); 库的XML路由,其中​​包含为每个子项检索xpathsapply()timestamp的解决方法,并处理timecode和{{1}的缺失属性}}:

id