R:根据数据框列表列中的值创建新字段

时间:2017-11-04 19:20:26

标签: r list dataframe dplyr

我使用jsonlite展平了一个json文件,最后是包含一个关键字段的列表列,在我的示例数据中,我称之为“衣服”:

df <- data.frame("ID" = c(1,2,3,4))
df$Things = list(list(Clothes = c("shirt","shoe","sock"), shapes = c("circle", "square")),
              list(Clothes = c("shirt","pant","jacket"), shapes = c("triangle", "circle")),
              list(Clothes = c("pant","belt"), shapes = c("pentagon", "square")),
              list(Clothes = c("shoe","scarf","sock"), shapes = c("circle", "pentagon")))

我的目标是将这些值拉出为新的二进制变量,指示每条记录是否包含每件衣服。我还想将这些衣物分成不同的列,即使服装列表有时也有不同的长度。正如您所看到的,list-column深层次,在Things列表中有Clothes列表。

以下是示例输出的样子:

dfOut <- mutate(df,belt = c(0,0,1,0),pant = c(0,1,1,0),shirt = c(1,1,0,0),
Clothes1 = c("shirt","shirt","pant","shoe"),
Clothes2 = c("shoe","pant","belt","scarf"),
Clothes3 = c("sock","jacket",NA,"sock"))

我认为解决方案涉及dplyr::mutate()purrr::map()apply()ifelse()。我也非常感谢有关正确术语/概念的帮助,以便将来可以更好地提出这些类型的问题。

3 个答案:

答案 0 :(得分:1)

我们可以做这样的事情来计算df$Things列表中出现的所有衣服:

library(tidyverse)

# keep only Clothes, drop Shapes, and unlist for ease
df$Things <- purrr::map(df$Things, ~ .[1] %>% unlist)

# build a self-named vector of clothes types, for colnames from map_dfc()
all_clothes <- unique(unlist(df$Things)) %>% set_names(.)

# count occurances with grepl() and convert from bool to num
counts <- purrr::map_dfc(all_clothes, ~ as.numeric(grepl(., df$Things)))

# bolt it on
dplyr::bind_cols(df, counts)

  ID              Things shirt shoe sock pant jacket belt scarf
1  1   shirt, shoe, sock     1    1    1    0      0    0     0
2  2 shirt, pant, jacket     1    0    0    1      1    0     0
3  3          pant, belt     0    0    0    1      0    1     0
4  4   shoe, scarf, sock     0    1    1    0      0    0     1

答案 1 :(得分:0)

您可以使用简单的双循环来完成任务的第一部分。

for (n in c("shirt", "scarf", "sock", "belt", "jacket","pant")) {
  for (i in 1:dim(df)[1]) {
    df[[n]][i] <- ifelse(n %in% df$Things[[i]]$Clothes, 1, 0)
  }
}
df  

   ID      Things  shirt   scarf    sock    belt    jacket    pant   shoe
    1  1     shirt, shoe, sock, circle, square     1     0    1    0      0    0    1
    2  2 shirt, pant, jacket, triangle, circle     1     0    0    0      1    1    0
    3  3          pant, belt, pentagon, square     0     0    0    1      0    1    0
    4  4   shoe, scarf, sock, circle, pentagon     0     1    1    0      0    0    1

对于第二部分,你可以尝试类似的东西

Clothes <- unlist(df$Things)
Clothes <- data.frame(Name=attr(cl, "names"),Thing=cl)
for (j in 1:3) {
  assign( paste0("Clothes",j), 
           as.character( (Clothes %>% filter(Name == paste0("Clothes",j))) [,2]) )
}
Clothes2
[1] "shoe"  "pant"  "belt"  "scarf"

但它没有给出 NA ,所以它不是你想要的。

答案 2 :(得分:0)

要完成此任务,首先要创建一个“整洁”的数据帧(请参阅http://tidyr.tidyverse.org/了解'整洁的数据'定义):

library(dplyr)
library(tidyr)
library(purrr)

tidy_df <- df %>%
  mutate(Clothes = map(Things, "Clothes")) %>%
  unnest(Clothes)
tidy_df

#>    ID Clothes
#> 1   1   shirt
#> 2   1    shoe
#> 3   1    sock
#> 4   2   shirt
#> 5   2    pant
#> 6   2  jacket
#> 7   3    pant
#> 8   3    belt
#> 9   4    shoe
#> 10  4   scarf
#> 11  4    sock

从那里,您可以使用tidyr::spread

为所需输出制作不同的组件
df1 <- tidy_df %>% 
  mutate(has_clothes = 1) %>%
  spread(Clothes, has_clothes, fill = 0)

df2 <- tidy_df %>% 
  group_by(ID) %>% 
  mutate(rownum = paste0("Clothes", row_number())) %>%
  spread(rownum, Clothes)

left_join(df1, df2)

#> Joining, by = "ID"
#>   ID belt jacket pant scarf shirt shoe sock Clothes1 Clothes2 Clothes3
#> 1  1    0      0    0     0     1    1    1    shirt     shoe     sock
#> 2  2    0      1    1     0     1    0    0    shirt     pant   jacket
#> 3  3    1      0    1     0     0    0    0     pant     belt     <NA>
#> 4  4    0      0    0     1     0    1    1     shoe    scarf     sock

也就是说,可以通过以下方式获得所需的输出dfOut

df %>% 
  left_join(df1, by = "ID") %>%
  left_join(df2, by = "ID")