R& xml2:按特定文本值定位元素,将所有子值存储在data.frame

时间:2016-05-20 12:27:20

标签: r xml xpath xml2

我使用定期更新的XML报告,我希望使用R& amp; XML2。

Here's a link to an entire example file. 以下是XML的示例:

<?xml version="1.0" ?>
<riDetailEnrolleeReport xmlns="http://vo.edge.fm.cms.hhs.gov">
    <includedFileHeader>
        <outboundFileIdentifier>f2e55625-e70e-4f9d-8278-fc5de7c04d47</outboundFileIdentifier>
        <cmsBatchIdentifier>RIP-2015-00096</cmsBatchIdentifier>
        <cmsJobIdentifier>16220</cmsJobIdentifier>
        <snapShotFileName>25032.BACKUP.D03152016T032051.dat</snapShotFileName>
        <snapShotFileHash>20d887c9a71fa920dbb91edc3d171eb64a784dd6</snapShotFileHash>
        <outboundFileGenerationDateTime>2016-03-15T15:20:54</outboundFileGenerationDateTime>
        <interfaceControlReleaseNumber>04.03.01</interfaceControlReleaseNumber>
        <edgeServerVersion>EDGEServer_14.09_01_b0186</edgeServerVersion>
        <edgeServerProcessIdentifier>8</edgeServerProcessIdentifier>
        <outboundFileTypeCode>RIDE</outboundFileTypeCode>
        <edgeServerIdentifier>2800273</edgeServerIdentifier>
        <issuerIdentifier>25032</issuerIdentifier>
    </includedFileHeader>
    <calendarYear>2015</calendarYear>
    <executionType>P</executionType>
    <includedInsuredMemberIdentifier>
        <insuredMemberIdentifier>ARS001</insuredMemberIdentifier>
        <memberMonths>12.13</memberMonths>
        <totalAllowedClaims>1000.00</totalAllowedClaims>
        <totalPaidClaims>100.00</totalPaidClaims>
        <moopAdjustedPaidClaims>100.00</moopAdjustedPaidClaims>
        <cSRMOOPAdjustment>0.00</cSRMOOPAdjustment>
        <estimatedRIPayment>0.00</estimatedRIPayment>
        <coinsurancePercentPayments>0.00</coinsurancePercentPayments>
        <includedPlanIdentifier>
            <planIdentifier>25032VA013000101</planIdentifier>
            <includedClaimIdentifier>
                <claimIdentifier>CADULT4SM00101</claimIdentifier>
                <claimPaidAmount>100.00</claimPaidAmount>
                <crossYearClaimIndicator>N</crossYearClaimIndicator>
            </includedClaimIdentifier>
        </includedPlanIdentifier>
    </includedInsuredMemberIdentifier>
    <includedInsuredMemberIdentifier>
        <insuredMemberIdentifier>ARS002</insuredMemberIdentifier>
        <memberMonths>9.17</memberMonths>
        <totalAllowedClaims>0.00</totalAllowedClaims>
        <totalPaidClaims>0.00</totalPaidClaims>
        <moopAdjustedPaidClaims>0.00</moopAdjustedPaidClaims>
        <cSRMOOPAdjustment>0.00</cSRMOOPAdjustment>
        <estimatedRIPayment>0.00</estimatedRIPayment>
        <coinsurancePercentPayments>0.00</coinsurancePercentPayments>
        <includedPlanIdentifier>
            <planIdentifier>25032VA013000101</planIdentifier>
            <includedClaimIdentifier>
                <claimIdentifier></claimIdentifier>
                <claimPaidAmount>0</claimPaidAmount>
                <crossYearClaimIndicator>N</crossYearClaimIndicator>
            </includedClaimIdentifier>
        </includedPlanIdentifier>
    </includedInsuredMemberIdentifier>
</riDetailEnrolleeReport>

我想:

  1. 将XML读入R
  2. 找到特定的insuredMemberIdentifier
  3. 在(2)
  4. 中提取与成员ID关联的planIdentifier和所有claimIdentifier数据
  5. 在data.frame中存储insuredMemberIdentifier,planIdentifier,claimIdentifier和claimPaidAmount的所有文本和值,每个唯一声明ID都有一行(声明ID的成员ID为1对多)
  6. 到目前为止,我已经完成了1并且我在2:

    的球场
    ## Step 1 ##
    ride <- read_xml("/Users/temp/Desktop/RIDetailEnrolleeReport.xml")
    
    ## Step 2 -- assume the insuredMemberIdentifier of interest is 'ARS001' ##
    memID <- xml_find_all(ride, "//d1:insuredMemberIdentifier[text()='ARS001']", xml_ns(ride))
    

    [我知道我可以使用xml_text()来提取元素的文本。]

    在上面的步骤2中的代码之后,我尝试使用xml_parent()找到insuredMemberIdentifier的父节点,将其保存为变量,然后在该保存的变量节点上重复步骤2以获取声明信息。

    node <- xml_parent(memID)
    xml_find_all(node, "//d1:claimIdentifier", xml_ns(ride))
    

    但这只会导致在全局文件中提取所有claimIdentifier。

    有关如何进入上述第4步的任何帮助/信息将不胜感激。提前谢谢。

1 个答案:

答案 0 :(得分:0)

为延迟响应而道歉,但对于后代,请使用 xml2 如上所述导入数据,然后按照har07的提示通过ID解析xml文件。

# output object to collect all claims
res <- data.frame(
    insuredMemberIdentifier = rep(NA, 1), 
    planIdentifier = NA, 
    claimIdentifier = NA, 
    claimPaidAmount = NA)
# vector of ids of interest
ids <- c('ARS001')
# indexing counter
starti <- 1
# loop through all ids
for (ii in seq_along(ids)) {
    # find ii-th id
    ## Step 2 -- assume the insuredMemberIdentifier of interest is 'ARS001' ##
    memID <- xml_find_all(x = ride, 
        xpath = paste0("//d1:insuredMemberIdentifier[text()='", ids[ii], "']"))
    # find node for 
    node <- xml_parent(memID)
    # as har07's comment find claim id within this node
    cid <- xml_find_all(node, ".//d1:claimIdentifier", xml_ns(ride))
    pid <- xml_find_all(node, ".//d1:planIdentifier", xml_ns(ride))
    cpa <- xml_find_all(node, ".//d1:claimPaidAmount", xml_ns(ride))
    # add invalid data handling if necessary
    if (length(cid) != length(cpa)) {
        warning(paste("cid and cpa do not match for", ids[ii]))
        next
    }
    # collect outputs 
    res[seq_along(cid) + starti - 1, ] <- list(
        ids[ii], 
        xml_text(pid),
        xml_text(cid),
        xml_text(cpa))
    # adjust counter to add next id into correct row
    starti <- starti + length(cid)
}
res
#   insuredMemberIdentifier   planIdentifier claimIdentifier claimPaidAmount
# 1                  ARS001 25032VA013000101  CADULT4SM00101          100.00