我使用jsonlite
展平了一个json文件,最后是包含一个关键字段的列表列,在我的示例数据中,我称之为“衣服”:
df <- data.frame("ID" = c(1,2,3,4))
df$Things = list(list(Clothes = c("shirt","shoe","sock"), shapes = c("circle", "square")),
list(Clothes = c("shirt","pant","jacket"), shapes = c("triangle", "circle")),
list(Clothes = c("pant","belt"), shapes = c("pentagon", "square")),
list(Clothes = c("shoe","scarf","sock"), shapes = c("circle", "pentagon")))
我的目标是将这些值拉出为新的二进制变量,指示每条记录是否包含每件衣服。我还想将这些衣物分成不同的列,即使服装列表有时也有不同的长度。正如您所看到的,list-column深层次,在Things列表中有Clothes列表。
以下是示例输出的样子:
dfOut <- mutate(df,belt = c(0,0,1,0),pant = c(0,1,1,0),shirt = c(1,1,0,0),
Clothes1 = c("shirt","shirt","pant","shoe"),
Clothes2 = c("shoe","pant","belt","scarf"),
Clothes3 = c("sock","jacket",NA,"sock"))
我认为解决方案涉及dplyr::mutate()
,purrr::map()
,apply()
或ifelse()
。我也非常感谢有关正确术语/概念的帮助,以便将来可以更好地提出这些类型的问题。
答案 0 :(得分:1)
我们可以做这样的事情来计算df$Things
列表中出现的所有衣服:
library(tidyverse)
# keep only Clothes, drop Shapes, and unlist for ease
df$Things <- purrr::map(df$Things, ~ .[1] %>% unlist)
# build a self-named vector of clothes types, for colnames from map_dfc()
all_clothes <- unique(unlist(df$Things)) %>% set_names(.)
# count occurances with grepl() and convert from bool to num
counts <- purrr::map_dfc(all_clothes, ~ as.numeric(grepl(., df$Things)))
# bolt it on
dplyr::bind_cols(df, counts)
ID Things shirt shoe sock pant jacket belt scarf
1 1 shirt, shoe, sock 1 1 1 0 0 0 0
2 2 shirt, pant, jacket 1 0 0 1 1 0 0
3 3 pant, belt 0 0 0 1 0 1 0
4 4 shoe, scarf, sock 0 1 1 0 0 0 1
答案 1 :(得分:0)
您可以使用简单的双循环来完成任务的第一部分。
for (n in c("shirt", "scarf", "sock", "belt", "jacket","pant")) {
for (i in 1:dim(df)[1]) {
df[[n]][i] <- ifelse(n %in% df$Things[[i]]$Clothes, 1, 0)
}
}
df
ID Things shirt scarf sock belt jacket pant shoe
1 1 shirt, shoe, sock, circle, square 1 0 1 0 0 0 1
2 2 shirt, pant, jacket, triangle, circle 1 0 0 0 1 1 0
3 3 pant, belt, pentagon, square 0 0 0 1 0 1 0
4 4 shoe, scarf, sock, circle, pentagon 0 1 1 0 0 0 1
对于第二部分,你可以尝试类似的东西
Clothes <- unlist(df$Things)
Clothes <- data.frame(Name=attr(cl, "names"),Thing=cl)
for (j in 1:3) {
assign( paste0("Clothes",j),
as.character( (Clothes %>% filter(Name == paste0("Clothes",j))) [,2]) )
}
Clothes2
[1] "shoe" "pant" "belt" "scarf"
但它没有给出 NA ,所以它不是你想要的。
答案 2 :(得分:0)
要完成此任务,首先要创建一个“整洁”的数据帧(请参阅http://tidyr.tidyverse.org/了解'整洁的数据'定义):
library(dplyr)
library(tidyr)
library(purrr)
tidy_df <- df %>%
mutate(Clothes = map(Things, "Clothes")) %>%
unnest(Clothes)
tidy_df
#> ID Clothes
#> 1 1 shirt
#> 2 1 shoe
#> 3 1 sock
#> 4 2 shirt
#> 5 2 pant
#> 6 2 jacket
#> 7 3 pant
#> 8 3 belt
#> 9 4 shoe
#> 10 4 scarf
#> 11 4 sock
从那里,您可以使用tidyr::spread
df1 <- tidy_df %>%
mutate(has_clothes = 1) %>%
spread(Clothes, has_clothes, fill = 0)
df2 <- tidy_df %>%
group_by(ID) %>%
mutate(rownum = paste0("Clothes", row_number())) %>%
spread(rownum, Clothes)
left_join(df1, df2)
#> Joining, by = "ID"
#> ID belt jacket pant scarf shirt shoe sock Clothes1 Clothes2 Clothes3
#> 1 1 0 0 0 0 1 1 1 shirt shoe sock
#> 2 2 0 1 1 0 1 0 0 shirt pant jacket
#> 3 3 1 0 1 0 0 0 0 pant belt <NA>
#> 4 4 0 0 0 1 0 1 1 shoe scarf sock
也就是说,可以通过以下方式获得所需的输出dfOut
df %>%
left_join(df1, by = "ID") %>%
left_join(df2, by = "ID")