我需要识别xml文件的所有唯一属性,以正确地将xml转换为数据框。
以下R脚本允许进行转换。但是只有这些属性是已知的。
library(rvest)
library(magrittr)
xml <- read_xml('<?xml version="1.0" encoding="UTF-8"?>
<movies>
<movie Id="1" Name="Movie 1" IMDB="8,4" Date="2008-07-31T00:00:00.000" Views="649" />
<movie Id="2" Name="Movie 2" IMDB="3,7" Location="El Cerrito, CA" Actor="Tom Hanks" />
</movies>')
movies <- xml %>% xml_nodes("movie")
data.frame(
Id = movies %>% xml_attr("Id"),
Name = movies %>% xml_attr("Name"),
IMDB = movies %>% xml_attr("IMDB"),
Date = movies %>% xml_attr("Date"),
Views = movies %>% xml_attr("Views"),
Location = movies %>% xml_attr("Location"),
Actor = movies %>% xml_attr("Actor")
)
输出将如下所示:
Id Name IMDB Date Views Location Actor
1 1 Movie 1 8,4 2008-07-31T00:00:00.000 649 <NA> <NA>
2 2 Movie 2 3,7 <NA> <NA> El Cerrito, CA Tom Hanks
如何获取所有唯一属性的列表(实际数据太长,无法手动检查)?
对于此示例,所需的输出应类似于以下列表:
[1] "Id"
[2] "Name"
[3] "IMDB"
[4] "Date"
[5] "Views"
[6] "Location"
[7] "Actor"
答案 0 :(得分:2)
使用数据:
Sample = '<?xml version="1.0" encoding="UTF-8"?>
<movies>
<movie Id="1" Name="Movie 1" IMDB="8,4" Date="2008-07-31T00:00:00.000" Views="649" />
<movie Id="2" Name="Movie 2" IMDB="3,7" Location="El Cerrito, CA" Actor="Tom Hanks" />
</movies>'
您可以从str_extract_all
包中的stringr
和正则表达式中获得大部分所需内容。至少像我所做的那样,您需要清理虚假=符号,然后使用unique
来消除重复项。
unique(sub("=", "", str_extract_all(Sample, "\\w+=")[[1]]))
[1] "version" "encoding" "Id" "Name" "IMDB" "Date" "Views"
[8] "Location" "Actor"
如果您不想要在标头中包含“ encoding”标记,则可以运行
Sample = sub(".*(<movies.*?</movies>).*", "\\1", Sample)
首先仅选择带有电影的部分。
答案 1 :(得分:1)
这是使用xml2包的通用方法(该包随rvest一起加载)。它可以工作并且有点冗长(以提供逐步的指导),但是我没有时间对其进行优化。请参阅代码的注释以获取有关其工作原理的说明。
library(xml2)
library(dplyr)
xml <- read_xml('<?xml version="1.0" encoding="UTF-8"?>
<movies>
<movie Id="1" Name="Movie 1" IMDB="8,4" Date="2008-07-31T00:00:00.000" Views="649" />
<movie Id="2" Name="Movie 2" IMDB="3,7" Location="El Cerrito, CA" Actor="Tom Hanks" />
</movies>')
#find all the movie nodes, returns a list of vectors
movies <- xml %>% xml_find_all("movie")
#get all of the attributes and their values
attrs<-xml_attrs(movies)
#convert the lists into rows and merge the rows
# finally convert to a data frame
# based on recommendation from jstuhh
finalanswer<-bind_rows(lapply(attrs, as.list))
答案 2 :(得分:1)
使用G5W的解决方案,创建数据帧的完整代码(我只需要调整子功能,以避免提取html信息):
library(XML)
library(rvest)
library(magrittr)
library(stringr)
# 1. Read xml to "xml_document" / "xml_node"
data_xml <- read_xml('<?xml version="1.0" encoding="UTF-8"?>
<movies>
<movie Id="1" Name="Movie 1" IMDB="8,4" Date="2008-07-31T00:00:00.000" Views="649" />
<movie Id="2" Name="Movie 2" IMDB="3,7" Location="El Cerrito, CA" Actor="Tom Hanks" />
</movies>')
# 2. Transform data to a string
data_char_all <- as.character(data_xml)
# 3. Remove 'encoding' and 'version' tag from the header
data_char_movies = sub(".*(<movies.*?</movies>).*", "\\1", data_char_all)
# 4. Extract all unique attributes
attr <- unique(sub("=\"", "", str_extract_all(data_char_movies, "\\w+=\"")[[1]]))
# 5. Create dataframe
# 5.1 Create xml_nodeset and assign all nodes
movies <- data_xml %>% xml_nodes("movie")
# 5.2 Create empty dataframe and assign values
df <- setNames(data.frame(matrix(ncol = length(attr), nrow = length(movies))), attr)
for (i in 1:length(attr)) {
df[i] <- movies %>% xml_attr(attr[i])
# 6. Print result
df
Id Name IMDB Date Views Location Actor
1 1 Movie 1 8,4 2008-07-31T00:00:00.000 649 <NA> <NA>
2 2 Movie 2 3,7 <NA> <NA> El Cerrito, CA Tom Hanks