使用R在单个单元格中有多个值时创建唯一值计数表

时间:2019-04-26 18:01:40

标签: r count unique

我正在尝试从看起来像这样的数据表中创建一个计数表:

df <- data.frame("Spring" = c("skirt, pants, shirt", "tshirt"), "Summer" = 
c("shorts, skirt", "pants, shoes"), Fall = c("Scarf", "purse, pants"))

               Spring        Summer         Fall
1 skirt, pants, shirt shorts, skirt        Scarf
2              tshirt  pants, shoes purse, pants

,最后是一个看起来像这样的计数表:

output <- data.frame("Spring" = 4, "Summer" = 4, Fall = 3)

  Spring Summer Fall
1      4      4    3

因此,我希望它能为每个季节计算一列中的唯一值。我在此遇到麻烦,因为逗号分隔1个单元格内的值。我尝试使用length(unique())),但是由于列数,它没有给我正确的数字。

感谢您的帮助!

2 个答案:

答案 0 :(得分:1)

一种tidyverse可能是:

df %>%
 mutate_if(is.factor, as.character) %>%
 gather(var, val) %>%
 mutate(val = strsplit(val, ", ")) %>%
 unnest() %>%
 group_by(var) %>%
 summarise(val = n_distinct(val))

  var      val
  <chr>  <int>
1 Fall       3
2 Spring     4
3 Summer     4

如果您想完全匹配所需的输出,则可以添加spread()

df %>%
 mutate_if(is.factor, as.character) %>%
 gather(var, val) %>%
 mutate(val = strsplit(val, ", ")) %>%
 unnest() %>%
 group_by(var) %>%
 summarise(val = n_distinct(val)) %>%
 spread(var, val)

   Fall Spring Summer
  <int>  <int>  <int>
1     3      4      4

或者使用@Sonny的基本思想(这只需要dplyr):

df %>%
 mutate_if(is.factor, as.character) %>%
 summarise_all(list(~ n_distinct(unlist(strsplit(., ", ")))))

  Spring Summer Fall
1      4      4    3

答案 1 :(得分:1)

使用summarise_all

getCount <- function(x) {
  x <- as.character(x)
  length(unique(unlist(strsplit(x, ","))))
}

library(dplyr)
df %>%
  summarise_all(funs(getCount))
  Spring Summer Fall
1      4      4    3