将多个xml文件导入并转换为DF

时间:2019-12-30 17:52:48

标签: r xml

我正在尝试创建一个例程,用于从给定目录中导入大量xml文件。可能我将不得不一次导入一千多个xml文件,并将它们转换为数据帧。我已经从单个文件创建了导入例程:

require(tidyverse)
require(xml2)
setwd("D:/")
page<- read_xml("base.xml")
ns<- page %>% xml_find_all(".//test:billing")
billing<-xml2::as_list(ns) %>% jsonlite::toJSON() %>% jsonlite::fromJSON()

我的示例xml(base.xml):

<?xml version="1.0" encoding="ISO-8859-1" ?>


<test:TASS xmlns="http://www.vvv.com/schemas"  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  xsi:schemaLocation="http://www.vvv.com/schemas http://www.vvv.com/schemas/testV2_02_03.xsd"  xmlns:test="http://www.vvv.com/schemas" >
    <test:house>
                <test:billing>
                    <test:proceduresummary>
                        <test:guidenumber>X2030</test:guidenumber>
                            <test:diagnosis>
                                <test:table>ICD-10</test:table>
                                <test:diagnosiscod>J441</test:diagnosiscod>
                                <test:description>CHRONIC OBSTRUCTIVE PULMONARY DISEASE WITH (ACUTE) EXACERBATION</test:description>
                            </test:diagnosis>
                            <test:procedure>
                                <test:procedure>
                                    <test:description>HOSPITAL</test:description>
                                </test:procedure>
                                <test:amount>12</test:amount>
                            </test:procedure>
                    </test:proceduresummary>
                </test:billing>
                    <test:billing>
                    <test:proceduresummary>
                        <test:guidenumber>Y6055</test:guidenumber>
                            <test:diagnosis>
                                <test:table>ICD-10</test:table>
                                <test:diagnosiscod>I21</test:diagnosiscod>
                                <test:description>ACUTE MYOCARDIAL INFARCTION</test:description>
                            </test:diagnosis>
                            <test:procedure>
                                <test:procedure>
                                    <test:description>HOSPITAL</test:description>
                                </test:procedure>
                                <test:amount>8</test:amount>
                            </test:procedure>
                    </test:proceduresummary>
                </test:billing>
                    <test:billing>
                    <test:proceduresummary>
                        <test:guidenumber>Z9088</test:guidenumber>
                            <test:diagnosis>
                                <test:table>ICD-10</test:table>
                                <test:diagnosiscod>F20</test:diagnosiscod>
                                <test:description>SCHIZOPHRENIA</test:description>
                            </test:diagnosis>
                            <test:procedure>
                                <test:procedure>
                                    <test:description>HOSPITAL</test:description>
                                </test:procedure>
                                <test:amount>1</test:amount>
                            </test:procedure>
                    </test:proceduresummary>
                </test:billing>
    </test:house>
</test:TASS>

我应该导入的目录中所有文件的示例(“ D:/”):

20215_ABFF20.xml
35700_38HY9R.xml
38597_40YY9J.xml
99853_99PP1Z.xml
115341_663QQP.xml

我尝试的第一步是识别目录(“ D:/”)中的所有文件,我这样做是

require(tidyverse)
require(xml2)
setwd("D:/")
files <- list.files(pattern = ".xml$")

如何一次性将所有xml文件导入并将其转换为数据框? (假设文件具有相同的结构。)

1 个答案:

答案 0 :(得分:0)

只需将一个XML的过程概括为用户定义的方法,然后使用lapply来构建数据帧列表,最后将所有数据帧堆叠在最后。

# USER-DEFINED METHOD
proc_xml <- function(xml_file) {
  page <- read_xml(xml_file)
  ns <- xml_find_all(page, ".//test:billing")
  billing <- jsonlite::fromJSON(jsonlite::toJSON(xml2::as_list(ns)))

  return(billing)
}

# BUILD LIST OF BILLING DATA FRAMES
files <- list.files(pattern = ".xml")
df_list <- lapply(files, proc_xml)

# CONCATENATE INTO ONE MASTER DATA FRAME
final_df <- dplyr::bind_rows(df_list)