Question

我想做什么

我有一个包含多个分组因子和一些其他数据的数据框。我想根据这些因素对行进行分组，并标记或提取属于具有多个成员的组的所有行。

问题/问题

我能够提出一个解决方案（参见下面的示例），但由于interaction()效率低下，解决方案不实用。即使drop = TRUE，interaction()的运行时间在级别数增加时也会急剧增加。最终，我想在一个数十万行的数据帧上处理10到20个因子，最多50和39个等级。

问题：1）解决此问题的最有效方法是什么？（＆＃34;高效＆＃34;按执行时间，内存要求和代码可读性按此顺序测量）

问题2）interaction()出了什么问题？

示例

# number of rows
nobs <- 100000
# number of levels
nlvl <- 5000

#create two factors with a decent number of levels
fc1 <- factor(sample.int(nlvl, size = nobs, replace = TRUE))
fc2 <- factor(sample.int(nlvl, size = nobs, replace = TRUE))
#package in a data.frame together with some arbitrary data
wdf <- data.frame(fc1, fc2, vals = sample.int(2, size = nobs, replace = TRUE))
#just for information: number of unique combinations of factors, i.e. groups
ngroups <- nrow(unique(wdf[,1:2]))
print(ngroups)

#granular grouping, tt has nobs elements and ngroups levels
tt <- interaction(wdf[,1:2], drop = TRUE)

#grpidx contains for each row the corresponding group (i.e. level of tt)
#observe that length(grpidx) == nobs and max(grpidx) == ngroups
grpidx <- match(tt, levels(tt))
#split into list of groups (containing row indices)
grplst <- split(seq_along(grpidx), grpidx)
#flag groups with more than one member
flg_dup <- vapply(grplst, FUN = function(x)length(x)>1, FUN.VALUE = TRUE)
#collect all row indices of groups with more than one member
dupidx <- unlist(grplst[flg_dup])
#select the corresponding rows
nonunqdf <- cbind(grpidx[dupidx], wdf[dupidx,])

行tt <- interaction(wdf[,1:2], drop = TRUE)

的时间安排

nlvl == 500：82毫秒
nlvl == 5000：28秒
nlvl == 10000：233秒

Answer 1

使用data.table（例如OP中的示例大小nobs = 1e5; nlvl = 5e3）...

library(data.table)
setDT(wdf) # convert to data.table in place

system.time(
  res <- wdf[, if (.N > 1) c(g = .GRP, .SD), by=.(fc1, fc2)]
)
# 0.04 seconds

DT[i, j, by]表示＆＃34;按i过滤，按by分组，然后执行j＆＃34;。

所以在这种情况下我们是

按fc1, fc2分组
计算每个组中的行.N
如果有足够的行，则返回组计数器.GRP以及数据子集.SD

有关符号的一般性报道，请参阅?data.table;有关特殊符号，请参阅?.N。

我建议访问the website并浏览小插图以开始使用该包。

替代方案。这种方式保留了原始行排序：

system.time(res2 <- wdf[, `:=`(g = .GRP, n = .N), by=.(fc1, fc2)][n > 1L])
# 0.06 seconds

这种基础R方式失败了：

system.time(res3 <- wdf[ave(vals, fc1, fc2, FUN = length) > 1])
# causes R to freeze while eating all my RAM... 
# probably because of too many factor combos

提取由多个因素定义的行，其中包含大量级别

我想做什么

问题/问题

示例

1 个答案: