我有一个包含多个分类列的数据框,具有不同数量的唯一条目。 当我将所有列分组并汇总在一起时,存在小于n的组,其中n是例如n。 2。 例如:
> df
A B C
1 x z a1
2 x z a2
3 x z a1
4 x w a1
5 x w a2
6 y w a1
7 y u a2
8 y u a2
9 y u a1
10 y u a1
DF = df %>% group_by_at(c(1:3)) %>% count()
# A tibble: 7 x 4
# Groups: A, B, C [7]
A B C n
<chr> <chr> <chr> <int>
1 x w a1 1
2 x w a2 1
3 x z a1 2
4 x z a2 1
5 y u a1 2
6 y u a2 2
7 y w a1 1
找到哪些单元格创建小于n的组的最有效方法是什么,并用一个统一值替换它们的值,让我们说&#34;其他&#34;,以便创建最小的组过程的大小为n? 我需要为更大的数据集执行此过程。
答案 0 :(得分:1)
有多种方法可以解决这个问题,例如,您只能将B
中的所有w和z替换为其他w和z。我能想到的最简单,最快的解决方案可能是使用data.table
,但这种方法是否有意义取决于您的应用程序。
df = structure(list(A = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("x", "y"), class = "factor"), B = structure(c(3L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("u", "w", "z"), class = "factor"), C = structure(c(1L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L), .Label = c("a1", "a2"), class = "factor")), .Names = c("A", "B", "C"), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))
library(data.table)
mingroup=2
setDT(df)[,n:=.N,.(A,B,C)][n<mingroup,c('A','B','C'):='other']
输出:
A B C n
1: x z a1 2
2: other other other 1
3: x z a1 2
4: other other other 1
5: other other other 1
6: other other other 1
7: y u a2 2
8: y u a2 2
9: y u a1 2
10: y u a1 2
替代:
df = structure(list(A = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("x", "y"), class = "factor"), B = structure(c(3L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("u", "w", "z"), class = "factor"), C = structure(c(1L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L), .Label = c("a1", "a2"), class = "factor")), .Names = c("A", "B", "C"), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))
df=setDT(df)
library(data.table)
mingroup=2
for(i in c('C','B','A'))
df[,n:=.N,.(A,B,C)][n<mingroup,eval(i):='other'][,n:=NULL]
输出:
A B C
1: x z a1
2: other other other
3: x z a1
4: x w other
5: x w other
6: other other other
7: y u a2
8: y u a2
9: y u a1
10: y u a1