我有一个包含多个分组因子和一些其他数据的数据框。我想根据这些因素对行进行分组,并标记或提取属于具有多个成员的组的所有行。
我能够提出一个解决方案(参见下面的示例),但由于interaction()
效率低下,解决方案不实用。即使drop = TRUE
,interaction()
的运行时间在级别数增加时也会急剧增加。最终,我想在一个数十万行的数据帧上处理10到20个因子,最多50和39个等级。
问题:1)解决此问题的最有效方法是什么? ("高效"按执行时间,内存要求和代码可读性按此顺序测量)
问题2)interaction()
出了什么问题?
# number of rows
nobs <- 100000
# number of levels
nlvl <- 5000
#create two factors with a decent number of levels
fc1 <- factor(sample.int(nlvl, size = nobs, replace = TRUE))
fc2 <- factor(sample.int(nlvl, size = nobs, replace = TRUE))
#package in a data.frame together with some arbitrary data
wdf <- data.frame(fc1, fc2, vals = sample.int(2, size = nobs, replace = TRUE))
#just for information: number of unique combinations of factors, i.e. groups
ngroups <- nrow(unique(wdf[,1:2]))
print(ngroups)
#granular grouping, tt has nobs elements and ngroups levels
tt <- interaction(wdf[,1:2], drop = TRUE)
#grpidx contains for each row the corresponding group (i.e. level of tt)
#observe that length(grpidx) == nobs and max(grpidx) == ngroups
grpidx <- match(tt, levels(tt))
#split into list of groups (containing row indices)
grplst <- split(seq_along(grpidx), grpidx)
#flag groups with more than one member
flg_dup <- vapply(grplst, FUN = function(x)length(x)>1, FUN.VALUE = TRUE)
#collect all row indices of groups with more than one member
dupidx <- unlist(grplst[flg_dup])
#select the corresponding rows
nonunqdf <- cbind(grpidx[dupidx], wdf[dupidx,])
行tt <- interaction(wdf[,1:2], drop = TRUE)
答案 0 :(得分:2)
使用data.table(例如OP中的示例大小nobs = 1e5; nlvl = 5e3
)...
library(data.table)
setDT(wdf) # convert to data.table in place
system.time(
res <- wdf[, if (.N > 1) c(g = .GRP, .SD), by=.(fc1, fc2)]
)
# 0.04 seconds
DT[i, j, by]
表示&#34;按i
过滤,按by
分组,然后执行j
&#34;。
所以在这种情况下我们是
按fc1, fc2
分组
计算每个组中的行.N
如果有足够的行,则返回组计数器.GRP
以及数据子集.SD
有关符号的一般性报道,请参阅?data.table
;有关特殊符号,请参阅?.N
。
我建议访问the website并浏览小插图以开始使用该包。
替代方案。这种方式保留了原始行排序:
system.time(res2 <- wdf[, `:=`(g = .GRP, n = .N), by=.(fc1, fc2)][n > 1L])
# 0.06 seconds
这种基础R方式失败了:
system.time(res3 <- wdf[ave(vals, fc1, fc2, FUN = length) > 1])
# causes R to freeze while eating all my RAM...
# probably because of too many factor combos