我希望将以下XML文档导入数据框: http://opensource.adobe.com/Spry/data/donuts.xml
应该创建3个数据帧:
(数据不必为3NF-即可以针对列出的每个项目重复击球手)
到目前为止,使用XML2包,我已经使用以下代码导入XML并将其转换为嵌套列表:
library(xml2)
xmlobj <- read_xml("http://opensource.adobe.com/Spry/data/donuts.xml")
ls1 <- as_list(xmlobj) #Converts XML to a nested list
如上所述,我现在正在寻求将列表解析/展平为3个数据帧。
如何最好地实现这一目标?它是否通过一系列循环(lapply / map),将对象传递到矢量然后加载数据帧?还是应该完全避免使用XML2 / List,而是使用XML包并使用XPath类型的语法来实现此目的?
我尝试了以下操作,可以提取单个项目的Item属性和元素,但是当我尝试应用该函数时,它崩溃了:
#Function for pulling out item attributes from list
ItemDF <- function(myItem){
#Gather Item data into DF including attributes
itemFrame <- data_frame(
id = attr(myItem$item,'id'),
type = attr(myItem$item,'type'),
name = unlist(myItem$item$name),
ppu = unlist(myItem$item$ppu)
)
return(itemFrame)
}
#Single instance
df1 <- ItemDF(ls1$items[1])
df1
#Lapply across all items throws an error
lapply(ls1$items,ItemDF)
(注意,该数据集是概念证明,因此我正在寻找一种方法,然后可以将其应用于我希望使用的其他XML文件)。
答案 0 :(得分:2)
library(xml2)
library( tidyverse )
xmlobj <- read_xml("http://opensource.adobe.com/Spry/data/donuts.xml")
df_items <- data.frame(
id = xml_find_all( xmlobj, ".//item" ) %>% xml_attr( "id" ),
type = xml_find_all( xmlobj, ".//item" ) %>% xml_attr( "type" ),
name = xml_find_all( xmlobj, ".//item/name" ) %>% xml_text(),
ppu = xml_find_all( xmlobj, ".//item/ppu" ) %>% xml_text(),
stringsAsFactors = FALSE )
# id type name ppu
# 1 0001 donut Cake 0.55
# 2 0002 donut Raised 0.55
# 3 0003 donut Buttermilk 0.55
# 4 0004 bar Bar 0.75
# 5 0005 twist Twist 0.65
# 6 0006 filled Filled 0.75
df_batters <- xml_find_all( xmlobj, ".//item" ) %>%
map_df(~{
set_names(
xml_find_all(.x, ".//batters/batter") %>% xml_attr( "id" ),
xml_find_all(.x, ".//batters/batter") %>% xml_text()
) %>%
as.list() %>%
flatten_df() %>%
mutate(itemID = xml_attr(.x, "id" ) )
}) %>%
type_convert() %>%
gather( batter, batterID, -itemID, na.rm = TRUE) %>%
select( batterID, batter, itemID )
# # A tibble: 10 x 3
# batterID batter itemID
# * <int> <chr> <chr>
# 1 1001 Regular 0001
# 2 1001 Regular 0002
# 3 1001 Regular 0003
# 4 1001 Regular 0004
# 5 1001 Regular 0005
# 6 1001 Regular 0006
# 7 1002 Chocolate 0001
# 8 1002 Chocolate 0003
# 9 1003 Blueberry 0001
# 10 1003 Devil's Food 0001
df_toppings <- xml_find_all( xmlobj, ".//item" ) %>%
map_df(~{
set_names(
xml_find_all(.x, ".//topping") %>% xml_attr( "id" ),
xml_find_all(.x, ".//topping") %>% xml_text()
) %>%
as.list() %>%
flatten_df() %>%
mutate(itemID = xml_attr(.x, "id" ) )
}) %>%
type_convert() %>%
gather( topping, toppingID, -itemID, na.rm = TRUE) %>%
select( toppingID, topping, itemID )
# # A tibble: 20 x 3
# toppingID topping itemID
# * <int> <chr> <chr>
# 1 5001 None 0001
# 2 5001 None 0002
# 3 5002 Glazed 0001
# 4 5002 Glazed 0002
# 5 5002 Glazed 0005
# 6 5002 Glazed 0006
# 7 5005 Sugar 0001
# 8 5005 Sugar 0002
# 9 5005 Sugar 0005
# 10 5007 Powdered Sugar 0001
# 11 5007 Powdered Sugar 0006
# 12 5006 Chocolate with Sprinkles 0001
# 13 5003 Chocolate 0001
# 14 5003 Chocolate 0002
# 15 5003 Chocolate 0004
# 16 5003 Chocolate 0006
# 17 5004 Maple 0001
# 18 5004 Maple 0002
# 19 5004 Maple 0004
# 20 5004 Maple 0006
答案 1 :(得分:0)
我的2美分作为项数据部分的键(仅限击球员):
df_batters <- xml_find_all(xmlobj, ".//item") %>%
map_df(~{
bind_cols(
itemID = xml_attr(.x, "id"),
batterID = xml_find_all(.x, ".//batters/batter") %>% xml_attr("id"),
batter = xml_find_all(.x, ".//batters/batter") %>% xml_text()
)}) %>% type_convert()
# itemID batterID batter
# <chr> <dbl> <chr>
# 1 0001 1001 Regular
# 2 0001 1002 Chocolate
# 3 0001 1003 Blueberry
# 4 0001 1003 Devil's Food
# 5 0002 1001 Regular
# 6 0003 1001 Regular
# 7 0003 1002 Chocolate
# 8 0004 1001 Regular
# 9 0005 1001 Regular
# 10 0006 1001 Regular