获取节点值和父属性

时间:2019-12-18 09:38:52

标签: r xml xml2

我有一个如下所示的XML文件:

w:tblStyle

我想收集所有 <dataDscr> <var ID="V335" name="question1" files="F1" dcml="0" intrvl="discrete"> <location width="1"/> <labl> question 1 label </labl> <qstn> <qstnLit> question 1 literal question </qstnLit> <ivuInstr> question 1 interviewer instructions </ivuInstr> </qstn> </var> <var ID="V335" name="question2" files="F1" dcml="0" intrvl="discrete"> <location width="1"/> <labl> question 2 label </labl> <qstn> <preQTxt> question 2 pre question text </preQTxt> <qstnLit> question 2 literal question </qstnLit> <ivuInstr> question 2 interviewer instructions </ivuInstr> </qstn> </var> <var ID="V335" name="question3" files="F1" dcml="0" intrvl="discrete"> <location width="1"/> <labl> question 3 label </labl> <qstn> <preQTxt> question 3 pre question text </preQTxt> <qstnLit> question 3 literal question </qstnLit> </qstn> </var> </dataDscr> 子级的值,以及父标记<qstn>中的name属性(即“ question1”)。请注意,<var>的孩子数量有所不同。例如,有<qstn>个两个孩子,即question1<qstnLit><ivuInstr>拥有question2可以拥有的所有孩子。

我希望最终结果看起来像这样:

<qstn>

谢谢!

1 个答案:

答案 0 :(得分:1)

这应该适合您的情况:

library(tidyverse)
library(xml2)

doc <- read_xml( "data.xml" )

# get all var elements
vars <- xml_find_all( doc, "//var" )

# extract from each "var" element the children of the "qstn" elements,
# then take the tag names and the enclosed text and put each in a column
df_long <- do.call( rbind, lapply(vars,
                             function(x) {
                               lbl <- xml_attr( x, "name" )
                               tags <- xml_find_all( x, "qstn/*" )
                               data.frame( name = lbl, 
                                           col = xml_name(tags), 
                                           txt = trimws(xml_text(tags)) )
                             }) ) 
# spread the data frame to wide format
df <- df_long %>% pivot_wider( name, names_from = col, values_from = txt )

输出:

# A tibble: 3 x 4
  name      qstnLit                     ivuInstr                            preQTxt                     
  <chr>     <chr>                       <chr>                               <chr>                       
1 question1 question 1 literal question question 1 interviewer instructions NA                          
2 question2 question 2 literal question question 2 interviewer instructions question 2 pre question text
3 question3 question 3 literal question NA                                  question 3 pre question text

在这里,pivot_wider处理不同数量的列,将NA放在var元素不存在的位置。