确定子集行中的大多数/最少量的出现次数&数据框中的列组

时间:2017-09-17 21:31:19

标签: r dataframe dplyr

我试图在更大的数据框中找到行/列组中最多和最少量的项目。以下是使数据更加清晰的数据:

df <- data.frame(matrix(nrow = 8, ncol = 3))
df$X1 <- c(1, 1, 1, 2, 2, 3, 3, 3)
df$X2 <- c("yellow", "green", "yellow", "blue", NA, "orange", NA, "orange") 
df$X3 <- c("green", "yellow", NA, "blue", "red", "purple" , "orange", NA) 
names(df) <- c("group", "A", "B")

这就是看起来的样子(我在原始数据中有NA,所以我已将它们包括在内):

  group      A      B
1     1 yellow  green
2     1  green yellow
3     1 yellow   <NA>
4     2   blue   blue
5     2   <NA>    red
6     3 orange purple
7     3   <NA> orange
8     3 orange   <NA>

例如,在第一个“组”中,我想确定哪种颜色最多,哪种颜色最少。看起来像这样:

  group      A      B   most  least
1     1 yellow  green yellow  green
2     1  green yellow yellow  green
3     1 yellow   <NA> yellow  green
4     2   blue   blue   blue    red
5     2   <NA>    red   blue    red
6     3 orange purple orange purple
7     3   <NA> orange orange purple
8     3 orange   <NA> orange purple

我在原始数据的dplyr链中工作,所以我可以group_by“分组”,但我很难找到一种方法,可以让我在“群集”中工作“具有不同行数的两列。我不需要使用dplyr完成此操作,但考虑到group_by的用处,我认为这可能是最简单的。另外,我需要将结果以某种方式保留在原始数据框中作为新列。有什么建议吗?

2 个答案:

答案 0 :(得分:3)

解决方案使用dplyrtidyr。策略是找到最多的&#34;和#34;至少&#34;项目并准备一个新的数据框架。之后,使用right_join合并原始数据框并准备所需的输出。

请注意,在此过程中,我使用slice对数据框进行了子集化,以获得最多和最少的项目。这保证了只有一个&#34;大多数&#34;和一个&#34;至少&#34;对于每个小组。尽管如此,每个群体可能会有一个平局。如果发生这种情况,你可能想要考虑什么是一个好的规则来确定哪一个是&#34;最多&#34;或者哪个是&#34;最少&#34;。

library(dplyr)
library(tidyr)

df2 <- df %>%
  gather(Column, Value, -group, na.rm = TRUE) %>%
  count(group, Value) %>%
  arrange(group, desc(n)) %>%
  group_by(group) %>%
  slice(c(1, n())) %>%
  mutate(Type = c("most", "least")) %>%
  select(-n) %>%
  spread(Type, Value) %>%
  right_join(df, by = "group") %>%
  select(c(colnames(df), "most", "least"))
df2
# A tibble: 8 x 5
  group      A      B   most  least
  <dbl>  <chr>  <chr>  <chr>  <chr>
1     1 yellow  green yellow  green
2     1  green yellow yellow  green
3     1 yellow   <NA> yellow  green
4     2   blue   blue   blue    red
5     2   <NA>    red   blue    red
6     3 orange purple orange purple
7     3   <NA> orange orange purple
8     3 orange   <NA> orange purple

答案 1 :(得分:2)

两个选项:

  1. 重塑为长格式并使用summarise(或count)汇总,对which.max / which.min进行分组:
  2. library(tidyverse)
    
    df <- data_frame(group = c(1, 1, 1, 2, 2, 3, 3, 3),
                     A = c("yellow", "green", "yellow", "blue", NA, "orange", NA, "orange"),
                     B = c("green", "yellow", NA, "blue", "red", "purple" , "orange", NA))
    
    
    df %>% 
        gather(var, color, A:B) %>% 
        drop_na(color) %>% 
        group_by(group, color) %>% 
        summarise(n = n()) %>% 
        summarise(most = color[which.max(n)], 
                  least = color[which.min(n)]) %>% 
        left_join(df, .)
    #> Joining, by = "group"
    #> # A tibble: 8 x 5
    #>   group      A      B   most  least
    #>   <dbl>  <chr>  <chr>  <chr>  <chr>
    #> 1     1 yellow  green yellow  green
    #> 2     1  green yellow yellow  green
    #> 3     1 yellow   <NA> yellow  green
    #> 4     2   blue   blue   blue    red
    #> 5     2   <NA>    red   blue    red
    #> 6     3 orange purple orange purple
    #> 7     3   <NA> orange orange purple
    #> 8     3 orange   <NA> orange purple
    
    1. 对值表进行排序并对其进行子集化:

      df %>% 
          group_by(group) %>% 
          mutate(most = last(names(sort(table(c(A, B))))),
                 least = first(names(sort(table(c(A, B))))))
      #> # A tibble: 8 x 5
      #> # Groups:   group [3]
      #>   group      A      B   most  least
      #>   <dbl>  <chr>  <chr>  <chr>  <chr>
      #> 1     1 yellow  green yellow  green
      #> 2     1  green yellow yellow  green
      #> 3     1 yellow   <NA> yellow  green
      #> 4     2   blue   blue   blue    red
      #> 5     2   <NA>    red   blue    red
      #> 6     3 orange purple orange purple
      #> 7     3   <NA> orange orange purple
      #> 8     3 orange   <NA> orange purple