计算唯一的值组合

时间:2012-01-14 12:07:26

标签: r unique combinations

我的数据框如下所示:

ID | value 1 | value 2 | value 3 | value 4
1  |    M    |    D    |    F    |   A
2  |    F    |    M    |    G    |   B
3  |    M    |    D    |    F    |   A
4  |    L    |    D    |    E    |   B

我想得到这样的东西。

value 1 | value 2 | value 3 | value 4|  Number of combinations
  M     |    D    |    F    |   A    |     2
  F     |    M    |    G    |   B    |     1
  L     |    D    |    E    |   B    |     1

e.g。计算列值1 - 值4的唯一组合的数量。

4 个答案:

答案 0 :(得分:12)

<{1}}包中的

count将执行该任务。

plyr

答案 1 :(得分:8)

current_user

使用data.table(最快)

N <- 10000

d <- data.frame(
  ID=seq(1, N), 
  v1=sample(c("M","F", "M", "L"), N, replace = TRUE), 
  v2=sample(c("D","M","D","D"), N, replace = TRUE), 
  v3=sample(c("F","G","F","E"), N, replace = TRUE),
  v4=sample(c("A","B","A","B"), N, replace = TRUE)
)

使用dplyr

dt <- data.table::as.data.table(d)
dt[, .N, by = c('v1','v2','v3','v4')]

使用plyr

dplyr::count_(d, vars = c('v1','v2','v3','v4'))

使用聚合(最慢)

plyr::count(d, vars = c('v1','v2','v3','v4'))
plyr::ddply(d, .variables = c('v1','v2','v3','v4'), nrow)

基准

aggregate(ID ~ ., d, FUN = length)

最好只使用microbenchmark::microbenchmark(dt[, .N, by = c('v1','v2','v3','v4')], plyr::count(d, vars = c('v1','v2','v3','v4')), plyr::ddply(d, .variables = c('v1','v2','v3','v4'), nrow), dplyr::count_(d, vars = c('v1','v2','v3','v4')), aggregate(ID ~ ., d, FUN = length), times = 1000) Unit: microseconds expr min lq mean median uq max neval cld dt[, .N, by = c("v1", "v2", "v3", "v4")] 887.807 1107.543 1263.777 1174.258 1289.724 4263.156 1000 a plyr::count(d, vars = c("v1", "v2", "v3", "v4")) 3912.791 4270.387 5379.080 4498.053 5791.743 157146.103 1000 c plyr::ddply(d, .variables = c("v1", "v2", "v3", "v4"), nrow) 7737.874 8553.370 10630.849 9018.266 11126.517 187301.696 1000 d dplyr::count_(d, vars = c("v1", "v2", "v3", "v4")) 2126.913 2432.957 2763.499 2568.251 2789.386 12549.669 1000 b aggregate(ID ~ ., d, FUN = length) 7395.440 8121.828 10546.659 8776.371 10858.263 210139.759 1000 d 代替data.table,因为它最快,不需要其他函数或库来计算。另请注意,data.frame函数对大型数据集的执行速度要慢得多。

最后注意事项:随时可以使用新方法进行更新。

答案 2 :(得分:6)

没有plyr。

aggregate(ID ~ ., d, FUN=length)# . means all variables in d except ID

答案 3 :(得分:0)

这是使用plyr

的解决方案
library(plyr)
d <- data.frame(
    ID=seq(1,4), v1=c("M","F", "M", "L"), 
    v2=c("D","M","D","D"), v3=c("F","G","F","E"), v4=c("A","B","A","B")
)
ddply(d,.(v1,v2,v3,v4), nrow)

我希望这不是作业......