在R中 - 找到最小数量的单元格,创建小于n的组

时间:2018-01-28 13:21:18

标签: r performance group-by categorical-data

我有一个包含多个分类列的数据框,具有不同数量的唯一条目。 当我将所有列分组并汇总在一起时,存在小于n的组,其中n是例如n。 2。 例如:

> df
   A B  C
1  x z a1
2  x z a2
3  x z a1
4  x w a1
5  x w a2
6  y w a1
7  y u a2
8  y u a2
9  y u a1
10 y u a1

DF = df %>% group_by_at(c(1:3)) %>% count()

# A tibble: 7 x 4
# Groups:   A, B, C [7]
  A     B     C         n
  <chr> <chr> <chr> <int>
1 x     w     a1        1
2 x     w     a2        1
3 x     z     a1        2
4 x     z     a2        1
5 y     u     a1        2
6 y     u     a2        2
7 y     w     a1        1

找到哪些单元格创建小于n的组的最有效方法是什么,并用一个统一值替换它们的值,让我们说&#34;其他&#34;,以便创建最小的组过程的大小为n? 我需要为更大的数据集执行此过程。

1 个答案:

答案 0 :(得分:1)

有多种方法可以解决这个问题,例如,您只能将B中的所有w和z替换为其他w和z。我能想到的最简单,最快的解决方案可能是使用data.table,但这种方法是否有意义取决于您的应用程序。

df = structure(list(A = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("x", "y"), class = "factor"), B = structure(c(3L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("u", "w", "z"), class = "factor"), C = structure(c(1L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L), .Label = c("a1", "a2"), class = "factor")), .Names = c("A", "B", "C"), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))
library(data.table)
mingroup=2
setDT(df)[,n:=.N,.(A,B,C)][n<mingroup,c('A','B','C'):='other']

输出:

      A     B     C n
 1:     x     z    a1 2
 2: other other other 1
 3:     x     z    a1 2
 4: other other other 1
 5: other other other 1
 6: other other other 1
 7:     y     u    a2 2
 8:     y     u    a2 2
 9:     y     u    a1 2
10:     y     u    a1 2

替代:

df = structure(list(A = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("x", "y"), class = "factor"), B = structure(c(3L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("u", "w", "z"), class = "factor"), C = structure(c(1L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L), .Label = c("a1", "a2"), class = "factor")), .Names = c("A", "B", "C"), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))
df=setDT(df)
library(data.table)
mingroup=2
for(i in c('C','B','A'))
  df[,n:=.N,.(A,B,C)][n<mingroup,eval(i):='other'][,n:=NULL]

输出:

        A     B     C
 1:     x     z    a1
 2: other other other
 3:     x     z    a1
 4:     x     w other
 5:     x     w other
 6: other other other
 7:     y     u    a2
 8:     y     u    a2
 9:     y     u    a1
10:     y     u    a1