使用R解析XML文件进入数据框

时间:2016-07-28 20:39:38

标签: r xml xml-parsing rbind

XML数据     

open(url).read

R代码

<HealthData locale="en_US">
 <ExportDate value="2016-06-02 14:05:23 -0400"/>
 <Me HKCharacteristicTypeIdentifierDateOfBirth="" HKCharacteristicTypeIdentifierBiologicalSex="HKBiologicalSexNotSet" HKCharacteristicTypeIdentifierBloodType="HKBloodTypeNotSet" HKCharacteristicTypeIdentifierFitzpatrickSkinType="HKFitzpatrickSkinTypeNotSet"/>
 <Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:07:06 -0400" endDate="2014-09-24 15:07:11 -0400" value="7"/>
 <Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:12:13 -0400" endDate="2014-09-24 15:12:18 -0400" value="15"/>
 <Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:17:16 -0400" endDate="2014-09-24 15:17:21 -0400" value="20"/>
</HealthData>

我正在尝试上面显示的XML数据示例并将其加载到R中的数据框中,每个Record的名称为Type,sourceName,unit,endDate,value作为列标题,每个Record值即count, 2014-09-24 15:07:11 -0400,7作为数据框中每一行的值。

> library(XML) > doc="\\pathtoXMLfile" > list <-xpathApply(doc, "//HealthData/Record", xmlAttrs) > df <- do.call(rbind.data.frame, list) > str(df) 关闭时,它看起来也会绑定列标题的所有值。如果您df <- do.call(rbind.data.frame, list)View(df),您会明白我的意思。如何使用Record变量名作为列标题名称?

谢谢, 莱恩

2 个答案:

答案 0 :(得分:1)

考虑xpathSApply()检索属性,然后将t()生成的列表转置为数据框:

library(XML)

xmlstr <- '<?xml version="1.0" encoding="UTF-8"?>
            <HealthData locale="en_US">
              <ExportDate value="2016-06-02 14:05:23 -0400"/>
              <Me HKCharacteristicTypeIdentifierDateOfBirth="" HKCharacteristicTypeIdentifierBiologicalSex="HKBiologicalSexNotSet" HKCharacteristicTypeIdentifierBloodType="HKBloodTypeNotSet" HKCharacteristicTypeIdentifierFitzpatrickSkinType="HKFitzpatrickSkinTypeNotSet"/>
              <Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:07:06 -0400" endDate="2014-09-24 15:07:11 -0400" value="7"/>
              <Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:12:13 -0400" endDate="2014-09-24 15:12:18 -0400" value="15"/>
              <Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:17:16 -0400" endDate="2014-09-24 15:17:21 -0400" value="20"/>
            </HealthData>'

xml <- xmlParse(xmlstr)

recordAttribs <- xpathSApply(doc=xml, path="//HealthData/Record",  xmlAttrs)
df <- data.frame(t(recordAttribs))
df

#                                type              sourceName  unit
# 1 HKQuantityTypeIdentifierStepCount Ryan Praskievicz iPhone count
# 2 HKQuantityTypeIdentifierStepCount Ryan Praskievicz iPhone count
# 3 HKQuantityTypeIdentifierStepCount Ryan Praskievicz iPhone count
#                creationDate                 startDate                   endDate
# 1 2014-10-02 08:30:17 -0400 2014-09-24 15:07:06 -0400 2014-09-24 15:07:11 -0400
# 2 2014-10-02 08:30:17 -0400 2014-09-24 15:12:13 -0400 2014-09-24 15:12:18 -0400
# 3 2014-10-02 08:30:17 -0400 2014-09-24 15:17:16 -0400 2014-09-24 15:17:21 -0400
#   value
# 1     7
# 2    15
# 3    20

如果属性出现在某些属性而非其他属性中,请考虑与预先确定的名称列表进行匹配,并迭代填写NAs。以下是使用带有sapply()循环的for和第二个列表参数的两个版本:

recordnames <- c("type", "unit", "sourceName", "device", "sourceVersion", 
                 "creationDate", "startDate", "endDate", "value")

# FOR LOOP VERSION
recordAttribs <- sapply(recordAttribs, function(i) {
  for (r in recordnames){
    i[r] <- ifelse(is.null(i[r]), NA, i[r])
  }
  i <- i[recordnames]  # REORDER INNER VECTORS
  return(i)
})

# TWO LIST ARGUMENT SAPPLY
recordAttribs <- sapply(recordAttribs, function(i,r) {  
    if (is.null(i[r])) i[r] <- NA
        else i[r] <- i[r]         
    i <- i[recordnames]  # REORDER INNER VECTORS
    return(i)
}, recordnames)


df <- data.frame(t(recordAttribs))

答案 1 :(得分:1)

另一个选项是xmlAttrsToDataFrame,它应该处理缺少的属性。您还可以获取具有特定属性的标签,例如设备

XML:::xmlAttrsToDataFrame(xml["//Record"])
XML:::xmlAttrsToDataFrame(xml["//Record[@device]"])