嗨,我正在尝试将以下XML代码转换为R中的数据帧。但是我不能,因为每条记录缺少值。
RecordID 23063具有以下数据:ActivityCreatedDate,ExpectedInstallDate,InvoiceTxnDate。但是,以下某些节点并不具备所有这些元素。 RecordID 23321缺少InvoiceTxnDate等。
<?xml version="1.0" encoding="windows-1252" ?>
<Record>
<RecordID>23063</RecordID>
<ActivityCreatedDate>2018-12-11T19:00:00</ActivityCreatedDate>
<ExpectedInstallDate>2018-12-19T19:00:00</ExpectedInstallDate>
<InvoiceTxnDate>2018-12-13T19:00:00</InvoiceTxnDate>
</Record>
<Record>
<RecordID>23321</RecordID>
<ActivityCreatedDate>2018-10-15T18:00:00</ActivityCreatedDate>
<ExpectedInstallDate>2018-11-14T19:00:00</ExpectedInstallDate>
</Record>
<Record>
<RecordID>23566</RecordID>
<ActivityCreatedDate>2019-01-23T19:00:00</ActivityCreatedDate>
</Record>
<Record>
<RecordID>23217</RecordID>
<ActivityCreatedDate>2018-12-20T19:00:00</ActivityCreatedDate>
<ExpectedInstallDate>2019-01-23T19:00:00</ExpectedInstallDate>
<InvoiceTxnDate>2019-01-18T19:00:00</InvoiceTxnDate>
</Record>
<Record>
<RecordID>23325</RecordID>
<ActivityCreatedDate>2018-05-25T18:00:00</ActivityCreatedDate>
<ExpectedInstallDate>2019-01-23T19:00:00</ExpectedInstallDate>
</Record>
</end of file>
当前我正在使用xml2。我正在使用read_xml将其读取到变量,然后使用xml_find_all和trimws将列存储到列表。然后,我尝试将列表转换为数据框,但由于维度已关闭,因此失败。
我想知道如何将上述XML转换为如下所示的数据框:
RecordID ActivityCreatedDate ExpectedInstallDate InvoiceTxnDate
1 23063 2018-12-11T19:00:00 2018-12-19T19:00:00 2018-12-13T19:00:00
2 23321 2018-10-15T18:00:00 2018-11-14T19:00:00 NA
3 23566 2019-01-23T19:00:00 NA NA
4 23217 2018-12-20T19:00:00 2019-01-23T19:00:00 2019-01-18T19:00:00
5 23325 2018-05-25T18:00:00 2019-01-23T19:00:00 NA
在这种情况下,是否有办法遍历每个RecordID并添加
<InvoiceTxnDate>NA</InvoiceTxnDate> or a <ExpectedInstallDate>NA</ExpectedInstallDate>
该节点是否丢失?然后,我会很乐意分享我拥有的统一数据的R代码。另外,如果这个问题没有道理,请告诉我,我将进一步解释自己。
谢谢
答案 0 :(得分:1)
您是否尝试过使用XML
软件包?
XML::xmlToDataFrame('path to xml file')
> XML::xmlToDataFrame('~/R/test.xml')
RecordID ActivityCreatedDate ExpectedInstallDate InvoiceTxnDate
1 23063 2018-12-11T19:00:00 2018-12-19T19:00:00 2018-12-13T19:00:00
2 23321 2018-10-15T18:00:00 2018-11-14T19:00:00 <NA>
3 23566 2019-01-23T19:00:00 <NA> <NA>
4 23217 2018-12-20T19:00:00 2019-01-23T19:00:00 2019-01-18T19:00:00
5 23325 2018-05-25T18:00:00 2019-01-23T19:00:00 <NA>
如果XML完全如上所示,没有根节点。您可以执行以下操作:
library(xml2)
library(rvest)
library(tidyverse)
## METHOD 1
## add missing root node
read_html('~/R/test.xml') %>% html_children() %>%
as_xml_document(root = 'doc') %>% xml_contents() %>% xml_contents() %>%
map_df(., function(x) {
kids <- xml_children(x)
setNames(as.list(type.convert(xml_text(kids))), xml_name(kids))
})
## METHOD 2
## treating the xml as a list
read_html('~/R/test.xml') %>%
html_nodes('record') %>%
as_list() %>%
lapply(., function(x) unlist(x, recursive = F) %>% bind_cols()) %>%
bind_rows()
## both of the above methods will return the following tibble
# A tibble: 5 x 4
recordid activitycreateddate expectedinstalldate invoicetxndate
<chr> <chr> <chr> <chr>
1 23063 2018-12-11T19:00:00 2018-12-19T19:00:00 2018-12-13T19:00:00
2 23321 2018-10-15T18:00:00 2018-11-14T19:00:00 NA
3 23566 2019-01-23T19:00:00 NA NA
4 23217 2018-12-20T19:00:00 2019-01-23T19:00:00 2019-01-18T19:00:00
5 23325 2018-05-25T18:00:00 2019-01-23T19:00:00 NA