此问题的版本之前已被问及here和here。但是,我还是无法让它发挥作用。我试图将XML文档解析为数据框。问题是某些观察结果不存在某些变量,所以我得到一个错误,因为行数不同。我的数据如下:
library("xml2")
library("dplyr")
example <- read_xml(
'
<particDesc>
<person role="participant" sameAs="#P484" xml:id="EDcon250_S1">
<age value="3">35-49</age>
<sex value="1">male</sex>
<occupation>waiter</occupation>
<langKnowledge>
<langKnown level="L1" tag="ita"/>
</langKnowledge>
</person>
<person role="participant" sameAs="#P485" xml:id="EDcon250_S7">
<age value="0">unknown</age>
<sex value="2">female</sex>
<occupation>waitress</occupation>
<langKnowledge>
<langKnown level="L1" tag="ger-AT"/>
</langKnowledge>
</person>
<person role="participant" sameAs="#P465" xml:id="EDcon250_S2">
<age value="2">25-34</age>
<sex value="2">female</sex>
<langKnowledge>
<langKnown level="L1" tag="ger-AT"/>
<langKnown level="L1" tag="eng-US"/>
</langKnowledge>
</person>
</particDesc>
')
我使用Wickham的xml2
包来读取xml。我更喜欢使用这个软件包,但如果这是解决此问题的最佳(或唯一)方法,则可以使用XML
。无论如何,我的代码如下:
participants <- xml_find_all(example, './/person[@role = "participant"]')
extract_participants <- function(div){
id <- xml_attr(div, "id")
same_as <- xml_attr(div, "sameAs")
role <- xml_attr(div, "role")
age <- xml_find_all(div, ".//age") %>% xml_text()
sex <- xml_find_all(div, ".//sex") %>% xml_text()
occupation <- xml_find_all(div, ".//occupation") %>% xml_text()
data_frame(id, same_as,role, age, sex, occupation)
}
parts_ls <- lapply(participants, extract_participants)
participants_df <- do.call(rbind, parts_ls)
此特定问题涉及occupation
变量(第三人没有变量),但在我的实际数据中,它也可能是其他变量之一。正如我所说,我可以看到之前已经问过这个问题,但我可以得到任何建议(可能是由于我没有完成对解决方案的理解)。最终,我希望在缺少特定节点时返回NAs(因此第三人称的occupation
变量将为NA
。
编辑时:
这是替代XML
版本
library("XML")
library("magrittr")
example2 <- xmlParse(
'
<particDesc>
<person role="participant" sameAs="#P484" xml:id="EDcon250_S1">
<age value="3">35-49</age>
<sex value="1">male</sex>
<occupation>waiter</occupation>
<langKnowledge>
<langKnown level="L1" tag="ita"/>
</langKnowledge>
</person>
<person role="participant" sameAs="#P485" xml:id="EDcon250_S7">
<age value="0">unknown</age>
<sex value="2">female</sex>
<occupation>waitress</occupation>
<langKnowledge>
<langKnown level="L1" tag="ger-AT"/>
</langKnowledge>
</person>
<person role="participant" sameAs="#P465" xml:id="EDcon250_S2">
<age value="2">25-34</age>
<sex value="2">female</sex>
<langKnowledge>
<langKnown level="L1" tag="ger-AT"/>
<langKnown level="L1" tag="eng-US"/>
</langKnowledge>
</person>
</particDesc>
')
example_root <- xmlRoot(example2)
process <- function(x){
id <- xmlGetAttr(x, "id")
role <- xmlGetAttr(x, "role")
age <- getNodeSet(x, ".//age") %>% xmlSApply(xmlValue)
sex <- getNodeSet(x, ".//sex") %>% xmlSApply(xmlValue)
#occupation <- getNodeSet(x, ".//occupation") %>% xmlSApply(xmlValue)
data.frame(id = id,
role = role,
#occupation = occupation,
age = age,
sex = sex,
stringsAsFactors = FALSE)
}
ls <- xpathApply(example_root, "//person", process)
df <- do.call(rbind, ls)
只需取消注释occupation
即可查看问题。
答案 0 :(得分:1)
我得到了一些工作,但我不确定它是否是一个理想的解决方案(我认为这是非常冗长的)。无论如何,这是我到目前为止所拥有的。欢迎提出改进建议。
library("XML")
library("magrittr")
example2 <- xmlParse(
'
<particDesc>
<person role="participant" sameAs="#P484" xml:id="EDcon250_S1">
<age value="3">35-49</age>
<sex value="1">male</sex>
<occupation>waiter</occupation>
<langKnowledge>
<langKnown level="L1" tag="ita"/>
</langKnowledge>
</person>
<person role="participant" sameAs="#P485" xml:id="EDcon250_S7">
<age value="0">unknown</age>
<sex value="2">female</sex>
<occupation>waitress</occupation>
<langKnowledge>
<langKnown level="L1" tag="ger-AT"/>
</langKnowledge>
</person>
<person role="participant" sameAs="#P465" xml:id="EDcon250_S2">
<age value="2">25-34</age>
<sex value="2">female</sex>
<langKnowledge>
<langKnown level="L1" tag="ger-AT"/>
<langKnown level="L1" tag="eng-US"/>
</langKnowledge>
</person>
</particDesc>
')
example_root <- xmlRoot(example2)
person <- getNodeSet(example_root, "//person")
id <- lapply(person, xmlGetAttr, "id") %>% unlist()
role <- lapply(person, xmlGetAttr, "role") %>% unlist()
age <- lapply(person, xpathSApply, ".//age", xmlValue) %>% unlist()
sex <- lapply(person, xpathSApply, ".//sex", xmlValue) %>% unlist()
occupation <- lapply(person, xpathSApply, ".//occupation", xmlValue)
occupation[sapply(occupation, is.list)] <- NA
occupation <- unlist(occupation)
df <- data.frame(
id = id,
role = role,
age = age,
sex = sex,
occupation = occupation)
开启编辑:
完成后,这里是相应的xml2
版本(删节)
example <- read_xml(
'
<particDesc>
<person role="participant" sameAs="#P484" xml:id="EDcon250_S1">
<age value="3">35-49</age>
<sex value="1">male</sex>
<occupation>waiter</occupation>
<langKnowledge>
<langKnown level="L1" tag="ita"/>
</langKnowledge>
</person>
<person role="participant" sameAs="#P485" xml:id="EDcon250_S7">
<age value="0">unknown</age>
<sex value="2">female</sex>
<occupation>waitress</occupation>
<langKnowledge>
<langKnown level="L1" tag="ger-AT"/>
</langKnowledge>
</person>
<person role="participant" sameAs="#P465" xml:id="EDcon250_S2">
<age value="2">25-34</age>
<sex value="2">female</sex>
<langKnowledge>
<langKnown level="L1" tag="ger-AT"/>
<langKnown level="L1" tag="eng-US"/>
</langKnowledge>
</person>
</particDesc>
')
participants <- xml_find_all(example, './/person[@role = "participant"]')
id <- lapply(participants, xml_attr, "id")
occupation <- lapply(participants, xml_find_all, ".//occupation")
occupation <- lapply(occupation, xml_text)
occupation[!sapply(occupation, function(y) length(y == 0))] <- NA
occupation <- unlist(occupation)
id <- unlist(id)
data_frame(
id = id,
occupation = occupation)