通过R读取保存在csv文件中的XML数据

时间:2015-01-07 21:12:32

标签: xml r

这是我第一次处理XML格式的数据,而我正在尝试理解这些问题。这个语义。我有一个保存在包含XML的.csv文件中的数据。它有超过100,000行,我想将这些数据提取到数据帧。

我使用了read.csv典型的单元格

<Assessment>
  <Key>Demographics</Key>
  <Name>Patient Demographics</Name>
  <AssessmentProvider></AssessmentProvider>
  <AssessmentVersion>1.0</AssessmentVersion>
  <AssessmentDate></AssessmentDate>
  <Questions>
    <Question>
      <Id>1</Id>
      <QuestionType>Label</QuestionType>
      <Answer></Answer>
    </Question>
    <Question>
      <Id>2</Id>
      <Key>PHN</Key>
      <QuestionText>Healthcare # (Required):</QuestionText>
      <QuestionType>Text</QuestionType>
      <Answer>999999999</Answer>
    </Question>
    <Question>
      <Id>3</Id>
      <Key>PHN_Type</Key>
      <QuestionText>Healthcare Number Issued By (Required):</QuestionText>
      <QuestionType>Combo</QuestionType>
      <DataItems>
        <DataItem Value="1" Code="AB">Alberta</DataItem>
        <DataItem Value="2" Code="BC">British Columbia</DataItem>
        <DataItem Value="3" Code="MB">Manitoba</DataItem>
        <DataItem Value="4" Code="NB">New Brunswick</DataItem>
        <DataItem Value="5" Code="NL">Newfoundland and Labrador</DataItem>
        <DataItem Value="6" Code="NT">Northwest Territories</DataItem>
        <DataItem Value="7" Code="NS">Nova Scotia</DataItem>
        <DataItem Value="8" Code="NU">Nunavut</DataItem>
        <DataItem Value="9" Code="ON">Ontario</DataItem>
        <DataItem Value="10" Code="PE">Prince Edward Island</DataItem>
        <DataItem Value="11" Code="QC">Quebec</DataItem>
        <DataItem Value="12" Code="SK">Saskatchewan</DataItem>
        <DataItem Value="13" Code="YT">Yukon</DataItem>
        <DataItem Value="14" Code="OT">Other</DataItem>
        <DataItem Value="15" Code="US">United States</DataItem>
      </DataItems>
      <Answer>1</Answer>
    </Question>
    <Question>
      <Id>4</Id>
      <Key>FirstName</Key>
      <QuestionText>First Name (Required):</QuestionText>
      <QuestionType>Text</QuestionType>
      <Answer>Merkel</Answer>
    </Question>
    <Question>
      <Id>5</Id>
      <Key>LastName</Key>
      <QuestionText>Last Name (Required):</QuestionText>
      <QuestionType>Text</QuestionType>
      <Answer>test</Answer>
    </Question>
    <Question>
      <Id>6</Id>
      <Key>DOB</Key>
      <QuestionText>Date of Birth (Required):</QuestionText>
      <QuestionType>DateTime</QuestionType>
      <Answer>2/4/1999 12:00:00 AM</Answer>
    </Question>
    <Question>
      <Id>6</Id>
      <Key>Gender</Key>
      <QuestionText>Gender:</QuestionText>
      <QuestionType>YesNo</QuestionType>
      <YesButtonText>Female</YesButtonText>
      <NoButtonText>Male</NoButtonText>
      <Answer>0</Answer>
    </Question>
    <Question>
      <Id>7</Id>
      <Key>MailingAddress</Key>
      <QuestionText>Current Mailing Address:</QuestionText>
      <QuestionType>Text</QuestionType>
      <Answer>123 elk place</Answer>
    </Question>
    <Question>
      <Id>8</Id>
      <Key>Province</Key>
      <QuestionText>Province (Required):</QuestionText>
      <QuestionType>Combo</QuestionType>
      <DataItems>
        <DataItem Value="1" Code="AB">Alberta</DataItem>
        <DataItem Value="2" Code="BC">British Columbia</DataItem>
        <DataItem Value="3" Code="MB">Manitoba</DataItem>
        <DataItem Value="4" Code="NB">New Brunswick</DataItem>
        <DataItem Value="5" Code="NL">Newfoundland and Labrador</DataItem>
        <DataItem Value="6" Code="NT">Northwest Territories</DataItem>
        <DataItem Value="7" Code="NS">Nova Scotia</DataItem>
        <DataItem Value="8" Code="NU">Nunavut</DataItem>
        <DataItem Value="9" Code="ON">Ontario</DataItem>
        <DataItem Value="10" Code="PE">Prince Edward Island</DataItem>
        <DataItem Value="11" Code="QC">Quebec</DataItem>
        <DataItem Value="12" Code="SK">Saskatchewan</DataItem>
        <DataItem Value="13" Code="YT">Yukon</DataItem>
        <DataItem Value="14" Code="OT">Other</DataItem>
        <DataItem Value="15" Code="US">United States</DataItem>
      </DataItems>
      <Answer>1</Answer>
    </Question>
    <Question>
      <Id>9</Id>
      <Key>Postal</Key>
      <QuestionText>Postal Code:</QuestionText>
      <QuestionType>Masked</QuestionType>
      <InputMask>L#L #L#</InputMask>
      <UpperCaseAction>True</UpperCaseAction>
      <Answer>T1A2B1</Answer>
    </Question>
    <Question>
      <Id>10</Id>
      <Key>Email</Key>
      <QuestionText>Email Address:</QuestionText>
      <QuestionType>Text</QuestionType>
      <Answer></Answer>
    </Question>
    <Question>
      <Id>11</Id>
      <Key>MobilePhone</Key>
      <QuestionText>Mobile Phone:</QuestionText>
      <QuestionType>Masked</QuestionType>
      <InputMask>###-###-####</InputMask>
      <UpperCaseAction>False</UpperCaseAction>
      <Answer></Answer>
    </Question>
  </Questions>
</Assessment>

如何从这里开始制作数据框?我也是XML包的新手。 (我稍后可能会编辑这个问题,以确保我没有在网上发布任何敏感的ID或信息)提前谢谢你。

我使用了以下语法并仍在努力弄清楚。

file <- "Z:\\Project\\PIP\\PIP.xml"
xmlfile<-xmlParse(file, useInternalNode=TRUE)
xmltop<- xmlRoot(xmlfile) #gives content of root
test <- xmlSApply(xmltop[["Worksheet"]][["Table"]], function(x) xmlSApply(x, xmlValue))
test_df <- data.frame(t(test),row.names=NULL)

我使用以下语法从csv中提取每个单元格,并将结果附加到数据框中。语法有效:

file1 <- "Kiosk.csv"
csv<-read.csv(file1, header=FALSE, sep=",", stringsAsFactor=FALSE)
zz<-file("xml.txt", open="wt", encoding="UTF-8")
sink(zz)
cat(unlist(unclass(csv[1,1])))
sink()
file <- "xml.txt"
xmlfile<-xmlParse(file)
xmltop<-xmlRoot(xmlfile)
ns1<-xmlToDataFrame(nodes=getNodeSet(xmltop,"//Assessment/Questions/Question/Answer"))
ns2<-xmlToDataFrame(nodes=getNodeSet(xmltop,"//Assessment/Questions/Question/Key"))
ns1<-cbind(ns2, ns1)

1 个答案:

答案 0 :(得分:2)

这应该让你开始。您可以逐行处理。以下是这样做,并提取IdKey&amp;来自每行XML的Answer。结果是一个三项数据框列表(井,数据表),如下所示:

 ##    Id            Key               Answer
 ## 1:  1     Disclaimer                   NA
 ## 2:  2            PHN            999999997
 ## 3:  3       PHN_Type                    1
 ## 4:  4      FirstName                  sal
 ## ...

您可以将它们设置为一个大数据框/数据表,单独处理它们,添加字段等。我使用数据表,因为可能需要使用fill=TRUE参数rbindlist真正节省时间。

library(XML)
library(data.table)
library(magrittr)

dat <- read.csv("sample.csv", header=FALSE, stringsAsFactors=FALSE)

lapply(dat$V1, function(xml_cell) {

  question_list <-
    xmlParse(xml_cell) %>%
    xpathApply("//Question", xmlToList) %>%
    lapply(function(x) {
      x[sapply(x, is.null)] <- NA
      x
  })

  rbindlist(lapply(question_list, "[", c("Id", "Key", "Answer")))

})