这是我第一次处理XML格式的数据,而我正在尝试理解这些问题。这个语义。我有一个保存在包含XML的.csv文件中的数据。它有超过100,000行,我想将这些数据提取到数据帧。
我使用了read.csv
典型的单元格
<Assessment>
<Key>Demographics</Key>
<Name>Patient Demographics</Name>
<AssessmentProvider></AssessmentProvider>
<AssessmentVersion>1.0</AssessmentVersion>
<AssessmentDate></AssessmentDate>
<Questions>
<Question>
<Id>1</Id>
<QuestionType>Label</QuestionType>
<Answer></Answer>
</Question>
<Question>
<Id>2</Id>
<Key>PHN</Key>
<QuestionText>Healthcare # (Required):</QuestionText>
<QuestionType>Text</QuestionType>
<Answer>999999999</Answer>
</Question>
<Question>
<Id>3</Id>
<Key>PHN_Type</Key>
<QuestionText>Healthcare Number Issued By (Required):</QuestionText>
<QuestionType>Combo</QuestionType>
<DataItems>
<DataItem Value="1" Code="AB">Alberta</DataItem>
<DataItem Value="2" Code="BC">British Columbia</DataItem>
<DataItem Value="3" Code="MB">Manitoba</DataItem>
<DataItem Value="4" Code="NB">New Brunswick</DataItem>
<DataItem Value="5" Code="NL">Newfoundland and Labrador</DataItem>
<DataItem Value="6" Code="NT">Northwest Territories</DataItem>
<DataItem Value="7" Code="NS">Nova Scotia</DataItem>
<DataItem Value="8" Code="NU">Nunavut</DataItem>
<DataItem Value="9" Code="ON">Ontario</DataItem>
<DataItem Value="10" Code="PE">Prince Edward Island</DataItem>
<DataItem Value="11" Code="QC">Quebec</DataItem>
<DataItem Value="12" Code="SK">Saskatchewan</DataItem>
<DataItem Value="13" Code="YT">Yukon</DataItem>
<DataItem Value="14" Code="OT">Other</DataItem>
<DataItem Value="15" Code="US">United States</DataItem>
</DataItems>
<Answer>1</Answer>
</Question>
<Question>
<Id>4</Id>
<Key>FirstName</Key>
<QuestionText>First Name (Required):</QuestionText>
<QuestionType>Text</QuestionType>
<Answer>Merkel</Answer>
</Question>
<Question>
<Id>5</Id>
<Key>LastName</Key>
<QuestionText>Last Name (Required):</QuestionText>
<QuestionType>Text</QuestionType>
<Answer>test</Answer>
</Question>
<Question>
<Id>6</Id>
<Key>DOB</Key>
<QuestionText>Date of Birth (Required):</QuestionText>
<QuestionType>DateTime</QuestionType>
<Answer>2/4/1999 12:00:00 AM</Answer>
</Question>
<Question>
<Id>6</Id>
<Key>Gender</Key>
<QuestionText>Gender:</QuestionText>
<QuestionType>YesNo</QuestionType>
<YesButtonText>Female</YesButtonText>
<NoButtonText>Male</NoButtonText>
<Answer>0</Answer>
</Question>
<Question>
<Id>7</Id>
<Key>MailingAddress</Key>
<QuestionText>Current Mailing Address:</QuestionText>
<QuestionType>Text</QuestionType>
<Answer>123 elk place</Answer>
</Question>
<Question>
<Id>8</Id>
<Key>Province</Key>
<QuestionText>Province (Required):</QuestionText>
<QuestionType>Combo</QuestionType>
<DataItems>
<DataItem Value="1" Code="AB">Alberta</DataItem>
<DataItem Value="2" Code="BC">British Columbia</DataItem>
<DataItem Value="3" Code="MB">Manitoba</DataItem>
<DataItem Value="4" Code="NB">New Brunswick</DataItem>
<DataItem Value="5" Code="NL">Newfoundland and Labrador</DataItem>
<DataItem Value="6" Code="NT">Northwest Territories</DataItem>
<DataItem Value="7" Code="NS">Nova Scotia</DataItem>
<DataItem Value="8" Code="NU">Nunavut</DataItem>
<DataItem Value="9" Code="ON">Ontario</DataItem>
<DataItem Value="10" Code="PE">Prince Edward Island</DataItem>
<DataItem Value="11" Code="QC">Quebec</DataItem>
<DataItem Value="12" Code="SK">Saskatchewan</DataItem>
<DataItem Value="13" Code="YT">Yukon</DataItem>
<DataItem Value="14" Code="OT">Other</DataItem>
<DataItem Value="15" Code="US">United States</DataItem>
</DataItems>
<Answer>1</Answer>
</Question>
<Question>
<Id>9</Id>
<Key>Postal</Key>
<QuestionText>Postal Code:</QuestionText>
<QuestionType>Masked</QuestionType>
<InputMask>L#L #L#</InputMask>
<UpperCaseAction>True</UpperCaseAction>
<Answer>T1A2B1</Answer>
</Question>
<Question>
<Id>10</Id>
<Key>Email</Key>
<QuestionText>Email Address:</QuestionText>
<QuestionType>Text</QuestionType>
<Answer></Answer>
</Question>
<Question>
<Id>11</Id>
<Key>MobilePhone</Key>
<QuestionText>Mobile Phone:</QuestionText>
<QuestionType>Masked</QuestionType>
<InputMask>###-###-####</InputMask>
<UpperCaseAction>False</UpperCaseAction>
<Answer></Answer>
</Question>
</Questions>
</Assessment>
如何从这里开始制作数据框?我也是XML
包的新手。 (我稍后可能会编辑这个问题,以确保我没有在网上发布任何敏感的ID或信息)提前谢谢你。
我使用了以下语法并仍在努力弄清楚。
file <- "Z:\\Project\\PIP\\PIP.xml"
xmlfile<-xmlParse(file, useInternalNode=TRUE)
xmltop<- xmlRoot(xmlfile) #gives content of root
test <- xmlSApply(xmltop[["Worksheet"]][["Table"]], function(x) xmlSApply(x, xmlValue))
test_df <- data.frame(t(test),row.names=NULL)
我使用以下语法从csv
中提取每个单元格,并将结果附加到数据框中。语法有效:
file1 <- "Kiosk.csv"
csv<-read.csv(file1, header=FALSE, sep=",", stringsAsFactor=FALSE)
zz<-file("xml.txt", open="wt", encoding="UTF-8")
sink(zz)
cat(unlist(unclass(csv[1,1])))
sink()
file <- "xml.txt"
xmlfile<-xmlParse(file)
xmltop<-xmlRoot(xmlfile)
ns1<-xmlToDataFrame(nodes=getNodeSet(xmltop,"//Assessment/Questions/Question/Answer"))
ns2<-xmlToDataFrame(nodes=getNodeSet(xmltop,"//Assessment/Questions/Question/Key"))
ns1<-cbind(ns2, ns1)
答案 0 :(得分:2)
这应该让你开始。您可以逐行处理。以下是这样做,并提取Id
,Key
&amp;来自每行XML的Answer
。结果是一个三项数据框列表(井,数据表),如下所示:
## Id Key Answer
## 1: 1 Disclaimer NA
## 2: 2 PHN 999999997
## 3: 3 PHN_Type 1
## 4: 4 FirstName sal
## ...
您可以将它们设置为一个大数据框/数据表,单独处理它们,添加字段等。我使用数据表,因为可能需要使用fill=TRUE
参数rbindlist
真正节省时间。
library(XML)
library(data.table)
library(magrittr)
dat <- read.csv("sample.csv", header=FALSE, stringsAsFactors=FALSE)
lapply(dat$V1, function(xml_cell) {
question_list <-
xmlParse(xml_cell) %>%
xpathApply("//Question", xmlToList) %>%
lapply(function(x) {
x[sapply(x, is.null)] <- NA
x
})
rbindlist(lapply(question_list, "[", c("Id", "Key", "Answer")))
})