我有以下代码:
library(tidyverse)
library(xml2)
xmlfile <- paste0("https://www.uniprot.org/uniprot/Q9NSI8.xml")
xml_doc <- read_xml(xmlfile)
xml_find_all(xml_doc, "//d1:reference")
#> {xml_nodeset (12)}
#> [1] <reference key="1">\n <citation type="submission" date="2000-01" d ...
#> [2] <reference key="2">\n <citation type="journal article" date="2001" ...
#> [3] <reference key="3">\n <citation type="submission" date="2002-06" d ...
#> [4] <reference key="4">\n <citation type="journal article" date="2004" ...
#> [5] <reference key="5">\n <citation type="journal article" date="2000" ...
#> [6] <reference key="6">\n <citation type="journal article" date="2004" ...
#> [7] <reference key="7">\n <citation type="journal article" date="2001" ...
#> [8] <reference key="8">\n <citation type="journal article" date="2004" ...
#> [9] <reference key="9">\n <citation type="journal article" date="2009" ...
#> [10] <reference key="10">\n <citation type="journal article" date="2011 ...
#> [11] <reference key="11">\n <citation type="journal article" date="2013 ...
#> [12] <reference key="12">\n <citation type="submission" date="2010-02" ...
我想要做的是将xml_find_all(xml_doc, "//d1:reference")
的输出转换为小标题。结果:
key type date ... title ... etc....
1 submission 2000-01 A novel gene, located on human chromosome 21q11.
2 journal article 2001
... etc ...
我该怎么做?
答案 0 :(得分:1)
1.-选择包含所需信息的节点(citation
)
lapply(nodeset, function(x) xml_children(x)[1])
2.-获取标题(它是一个值):
lapply(nodeset, function(x) c(Title = xml_text(xml_children(x)[1])))
3.-获取另一个信息(它是属性):
ll <- lapply(nodeset,
function(x) c(Title = xml_text(xml_children(x)[1]),
xml_attrs(xml_children(x)[1])[[1]]))
4.-获取data.frame:
您的xml中有两个不同的结构,一个长度为4,另一个长度为7(请参见lapply(ll, length)
)。
长度7:
df1 <- as.data.frame(do.call(rbind, ll[unlist(lapply(ll, length)) == 7]))
str(df1)
'data.frame': 9 obs. of 7 variables:
$ Title : Factor w/ 9 levels "Complete sequencing and characterization of 21,243 full-length human cDNAs.",..: 2 1 6 8 3 7 4 5 9
$ type : Factor w/ 1 level "journal article": 1 1 1 1 1 1 1 1 1
$ date : Factor w/ 6 levels "2000","2001",..: 2 3 1 3 2 3 4 5 6
$ name : Factor w/ 9 levels "Biochem. Biophys. Res. Commun.",..: 9 7 8 4 1 5 2 3 6
$ volume: Factor w/ 9 levels "10","12","14",..: 4 7 8 3 6 5 1 9 2
$ first : Factor w/ 9 levels "137","17","2121",..: 8 6 5 3 1 9 7 2 4
$ last : Factor w/ 9 levels "141","17","2127",..: 8 6 5 3 1 9 7 2 4
长度4:
df2 <- as.data.frame(do.call(rbind, ll[unlist(lapply(ll, length)) == 4]))
str(df2)
'data.frame': 3 obs. of 4 variables:
$ Title: Factor w/ 3 levels "A novel gene, located on human chromosome 21q11.",..: 1 3 2
$ type : Factor w/ 1 level "submission": 1 1 1
$ date : Factor w/ 3 levels "2000-01","2002-06",..: 1 2 3
$ db : Factor w/ 2 levels "EMBL/GenBank/DDBJ databases",..: 1 1 2
添加ID:
ll <- lapply(nodeset,
function(x) c(Title = xml_text(xml_children(x)[1]),
xml_attrs(xml_children(x)[1])[[1]],
id = try(xml_attr(xml_child(xml_children(x)[1], 3), "id"))))
df1 <- as.data.frame(do.call(rbind, ll[unlist(lapply(ll, length)) == 8]))
str(df1)
'data.frame': 9 obs. of 8 variables:
$ Title : Factor w/ 9 levels "Complete sequencing and characterization of 21,243 full-length human cDNAs.",..: 2 1 6 8 3 7 4 5 9
$ type : Factor w/ 1 level "journal article": 1 1 1 1 1 1 1 1 1
$ date : Factor w/ 6 levels "2000","2001",..: 2 3 1 3 2 3 4 5 6
$ name : Factor w/ 9 levels "Biochem. Biophys. Res. Commun.",..: 9 7 8 4 1 5 2 3 6
$ volume: Factor w/ 9 levels "10","12","14",..: 4 7 8 3 6 5 1 9 2
$ first : Factor w/ 9 levels "137","17","2121",..: 8 6 5 3 1 9 7 2 4
$ last : Factor w/ 9 levels "141","17","2127",..: 8 6 5 3 1 9 7 2 4
$ id : Factor w/ 9 levels "10830953","11536050",..: 2 4 1 6 3 5 7 8 9
答案 1 :(得分:1)
尝试使用此变通方法purrr::map_df(~as.list(.))
。就您而言,
tmp <- xml_doc %>%
xml2::xml_find_all("//d1:reference")
key <- tmp %>%
xml2::xml_attrs() %>%
purrr::map_df(~as.list(.))
ref <- tmp %>%
xml2::xml_children() %>%
xml2::xml_attrs() %>%
purrr::map_df(~as.list(.))
,然后将其与dplyr::bind_cols(key, ref)
合并。希望这会有所帮助。