我试图在更大的数据框中找到行/列组中最多和最少量的项目。以下是使数据更加清晰的数据:
df <- data.frame(matrix(nrow = 8, ncol = 3))
df$X1 <- c(1, 1, 1, 2, 2, 3, 3, 3)
df$X2 <- c("yellow", "green", "yellow", "blue", NA, "orange", NA, "orange")
df$X3 <- c("green", "yellow", NA, "blue", "red", "purple" , "orange", NA)
names(df) <- c("group", "A", "B")
这就是看起来的样子(我在原始数据中有NA,所以我已将它们包括在内):
group A B
1 1 yellow green
2 1 green yellow
3 1 yellow <NA>
4 2 blue blue
5 2 <NA> red
6 3 orange purple
7 3 <NA> orange
8 3 orange <NA>
例如,在第一个“组”中,我想确定哪种颜色最多,哪种颜色最少。看起来像这样:
group A B most least
1 1 yellow green yellow green
2 1 green yellow yellow green
3 1 yellow <NA> yellow green
4 2 blue blue blue red
5 2 <NA> red blue red
6 3 orange purple orange purple
7 3 <NA> orange orange purple
8 3 orange <NA> orange purple
我在原始数据的dplyr
链中工作,所以我可以group_by
“分组”,但我很难找到一种方法,可以让我在“群集”中工作“具有不同行数的两列。我不需要使用dplyr
完成此操作,但考虑到group_by
的用处,我认为这可能是最简单的。另外,我需要将结果以某种方式保留在原始数据框中作为新列。有什么建议吗?
答案 0 :(得分:3)
解决方案使用dplyr
和tidyr
。策略是找到最多的&#34;和#34;至少&#34;项目并准备一个新的数据框架。之后,使用right_join
合并原始数据框并准备所需的输出。
请注意,在此过程中,我使用slice
对数据框进行了子集化,以获得最多和最少的项目。这保证了只有一个&#34;大多数&#34;和一个&#34;至少&#34;对于每个小组。尽管如此,每个群体可能会有一个平局。如果发生这种情况,你可能想要考虑什么是一个好的规则来确定哪一个是&#34;最多&#34;或者哪个是&#34;最少&#34;。
library(dplyr)
library(tidyr)
df2 <- df %>%
gather(Column, Value, -group, na.rm = TRUE) %>%
count(group, Value) %>%
arrange(group, desc(n)) %>%
group_by(group) %>%
slice(c(1, n())) %>%
mutate(Type = c("most", "least")) %>%
select(-n) %>%
spread(Type, Value) %>%
right_join(df, by = "group") %>%
select(c(colnames(df), "most", "least"))
df2
# A tibble: 8 x 5
group A B most least
<dbl> <chr> <chr> <chr> <chr>
1 1 yellow green yellow green
2 1 green yellow yellow green
3 1 yellow <NA> yellow green
4 2 blue blue blue red
5 2 <NA> red blue red
6 3 orange purple orange purple
7 3 <NA> orange orange purple
8 3 orange <NA> orange purple
答案 1 :(得分:2)
两个选项:
summarise
(或count
)汇总,对which.max
/ which.min
进行分组:library(tidyverse)
df <- data_frame(group = c(1, 1, 1, 2, 2, 3, 3, 3),
A = c("yellow", "green", "yellow", "blue", NA, "orange", NA, "orange"),
B = c("green", "yellow", NA, "blue", "red", "purple" , "orange", NA))
df %>%
gather(var, color, A:B) %>%
drop_na(color) %>%
group_by(group, color) %>%
summarise(n = n()) %>%
summarise(most = color[which.max(n)],
least = color[which.min(n)]) %>%
left_join(df, .)
#> Joining, by = "group"
#> # A tibble: 8 x 5
#> group A B most least
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 yellow green yellow green
#> 2 1 green yellow yellow green
#> 3 1 yellow <NA> yellow green
#> 4 2 blue blue blue red
#> 5 2 <NA> red blue red
#> 6 3 orange purple orange purple
#> 7 3 <NA> orange orange purple
#> 8 3 orange <NA> orange purple
对值表进行排序并对其进行子集化:
df %>%
group_by(group) %>%
mutate(most = last(names(sort(table(c(A, B))))),
least = first(names(sort(table(c(A, B))))))
#> # A tibble: 8 x 5
#> # Groups: group [3]
#> group A B most least
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 yellow green yellow green
#> 2 1 green yellow yellow green
#> 3 1 yellow <NA> yellow green
#> 4 2 blue blue blue red
#> 5 2 <NA> red blue red
#> 6 3 orange purple orange purple
#> 7 3 <NA> orange orange purple
#> 8 3 orange <NA> orange purple