XML数据
open(url).read
R代码
<HealthData locale="en_US">
<ExportDate value="2016-06-02 14:05:23 -0400"/>
<Me HKCharacteristicTypeIdentifierDateOfBirth="" HKCharacteristicTypeIdentifierBiologicalSex="HKBiologicalSexNotSet" HKCharacteristicTypeIdentifierBloodType="HKBloodTypeNotSet" HKCharacteristicTypeIdentifierFitzpatrickSkinType="HKFitzpatrickSkinTypeNotSet"/>
<Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:07:06 -0400" endDate="2014-09-24 15:07:11 -0400" value="7"/>
<Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:12:13 -0400" endDate="2014-09-24 15:12:18 -0400" value="15"/>
<Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:17:16 -0400" endDate="2014-09-24 15:17:21 -0400" value="20"/>
</HealthData>
我正在尝试上面显示的XML数据示例并将其加载到R中的数据框中,每个Record的名称为Type,sourceName,unit,endDate,value作为列标题,每个Record值即count, 2014-09-24 15:07:11 -0400,7作为数据框中每一行的值。
当> library(XML)
> doc="\\pathtoXMLfile"
> list <-xpathApply(doc, "//HealthData/Record", xmlAttrs)
> df <- do.call(rbind.data.frame, list)
> str(df)
关闭时,它看起来也会绑定列标题的所有值。如果您df <- do.call(rbind.data.frame, list)
或View(df)
,您会明白我的意思。如何使用Record变量名作为列标题名称?
谢谢, 莱恩
答案 0 :(得分:1)
考虑xpathSApply()
检索属性,然后将t()
生成的列表转置为数据框:
library(XML)
xmlstr <- '<?xml version="1.0" encoding="UTF-8"?>
<HealthData locale="en_US">
<ExportDate value="2016-06-02 14:05:23 -0400"/>
<Me HKCharacteristicTypeIdentifierDateOfBirth="" HKCharacteristicTypeIdentifierBiologicalSex="HKBiologicalSexNotSet" HKCharacteristicTypeIdentifierBloodType="HKBloodTypeNotSet" HKCharacteristicTypeIdentifierFitzpatrickSkinType="HKFitzpatrickSkinTypeNotSet"/>
<Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:07:06 -0400" endDate="2014-09-24 15:07:11 -0400" value="7"/>
<Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:12:13 -0400" endDate="2014-09-24 15:12:18 -0400" value="15"/>
<Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:17:16 -0400" endDate="2014-09-24 15:17:21 -0400" value="20"/>
</HealthData>'
xml <- xmlParse(xmlstr)
recordAttribs <- xpathSApply(doc=xml, path="//HealthData/Record", xmlAttrs)
df <- data.frame(t(recordAttribs))
df
# type sourceName unit
# 1 HKQuantityTypeIdentifierStepCount Ryan Praskievicz iPhone count
# 2 HKQuantityTypeIdentifierStepCount Ryan Praskievicz iPhone count
# 3 HKQuantityTypeIdentifierStepCount Ryan Praskievicz iPhone count
# creationDate startDate endDate
# 1 2014-10-02 08:30:17 -0400 2014-09-24 15:07:06 -0400 2014-09-24 15:07:11 -0400
# 2 2014-10-02 08:30:17 -0400 2014-09-24 15:12:13 -0400 2014-09-24 15:12:18 -0400
# 3 2014-10-02 08:30:17 -0400 2014-09-24 15:17:16 -0400 2014-09-24 15:17:21 -0400
# value
# 1 7
# 2 15
# 3 20
如果属性出现在某些属性而非其他属性中,请考虑与预先确定的名称列表进行匹配,并迭代填写NAs
。以下是使用带有sapply()
循环的for
和第二个列表参数的两个版本:
recordnames <- c("type", "unit", "sourceName", "device", "sourceVersion",
"creationDate", "startDate", "endDate", "value")
# FOR LOOP VERSION
recordAttribs <- sapply(recordAttribs, function(i) {
for (r in recordnames){
i[r] <- ifelse(is.null(i[r]), NA, i[r])
}
i <- i[recordnames] # REORDER INNER VECTORS
return(i)
})
# TWO LIST ARGUMENT SAPPLY
recordAttribs <- sapply(recordAttribs, function(i,r) {
if (is.null(i[r])) i[r] <- NA
else i[r] <- i[r]
i <- i[recordnames] # REORDER INNER VECTORS
return(i)
}, recordnames)
df <- data.frame(t(recordAttribs))
答案 1 :(得分:1)
另一个选项是xmlAttrsToDataFrame
,它应该处理缺少的属性。您还可以获取具有特定属性的标签,例如设备
XML:::xmlAttrsToDataFrame(xml["//Record"])
XML:::xmlAttrsToDataFrame(xml["//Record[@device]"])