Question

我正在尝试计算以长格式（rowid，colid，value）表示为data.table的矩阵上的分位数。为此，我将其转换为Matrix :: sparseMatrix，然后计算分位数。我想知道是否有更有效的方法来做到这一点？（使用R 3.2.1和github的data.table 1.9.5）

require(data.table)
require(Matrix)

set.seed(100)
nobs <- 1000   #num rows in matrix
nvar <- 10    #num columns in matrix
density <- .1  #fraction of non-zero values in matrix

nrow <- round(density*nobs*nvar)
data.dt <- unique(data.table(obsid=sample(1:nobs,nrow,replace=T), 
        varid=sample(1:nvar,nrow,replace=T)))
data.dt <- data.dt[, value:=runif(.N)]

probs <- c(1,5,10,25,50,75,90,95,100)

#approach 1
system.time({
data.mat <- sparseMatrix(i=data.dt[,obsid], j=data.dt[,varid], x=data.dt[,value], dims=c(nobs,nvar))
quantile1.dt <- data.table(t(sapply(1:nvar, function(n) c(n,quantile(data.mat[,n], probs=probs/100, names=FALSE)))))
quantile1.dt <- setnames(quantile1.dt, c("varid",sprintf("p%02d",probs)))[order(varid)]
})

#approach 2
system.time({
quantile2.dt <- data.dt[, as.list(quantile(c(rep(0,nobs-.N), value), probs=probs/100, names=FALSE)), by=varid]
quantile2.dt <- setnames(quantile2.dt, c("varid",sprintf("p%02d",probs)))[order(varid)]
})

all.equal(quantile1.dt, quantile2.dt)

更新我找到了答案并希望分享，以防有人发现它有用！我最初的问题是方法1.计算相同方法的更好方法是方法2.当nobs和nvar很大时，可以看到方法2的真正优势。例如，当nobs = 100,000且nvar = 1,000时，approach1需要27sec而approach2需要4sec！

Answer 1

根据你的描述，看到你想做什么对我来说有点困难，所以我会做一个基本的例子。

set.seed(100)
nrow <- 10
ncol <- 5
prop <- 0.1
nobs <- round(prop*nrow*ncol)
s1 <- c(5,7,8,8,9) # sample(1:nrow, nobs, replace=T)
s2 <- c(1,3,3,4,4) # sample(1:ncol, nobs, replace=T)

# unique pairs
arr <- unique(array(c(s1,s2), dim=c(nobs,2)))

# random num for each unique pair
s3 <- c(0.1, 0.5, 0.8, 0.2, 0.4) # runif(length(arr[,1]))

# show data
data.frame(v1=arr[,1], v2=arr[,2], v3=s3)

#   v1 v2  v3
# 1  5  1 0.1
# 2  7  3 0.5
# 3  8  3 0.8
# 4  8  4 0.2
# 5  9  4 0.4

在这种情况下，稀疏矩阵表示为：

sm <- sparseMatrix(i=s1, j=s2, x=s3) # since all pairs are unique here

# row 1 corresponds to s1=1, ..., row 9 corresponds to s1=9
# column 1 corresponds to s2=1, ... column 4 corresponds to s2=4
sm

# [1,] .   . .   .  
# [2,] .   . .   .
# [3,] .   . .   .  
# [4,] .   . .   .  
# [5,] 0.1 . .   .  
# [6,] .   . .   .  
# [7,] .   . 0.5 .  
# [8,] .   . 0.8 0.2  
# [9,] .   . .   0.4

与s2=1对应的值为(0,0,0,0,0.1,0,0,0,0,0)'，依此类推。我们可以通过以下方式找到每个列的分位数：

q <- c(0.25, 0.5, 0.75, 1.0) # quantiles 
data.table(t(sapply(1:4, function(n) c(n,quantile(sm[,n], q)))))

#    V1 25% 50% 75% 100%
# 1:  1   0   0   0  0.1
# 2:  2   0   0   0  0.0
# 3:  3   0   0   0  0.8
# 4:  4   0   0   0  0.4

（请注意，此处有9个零，但应该有10个。请注意，如果我在上面的1:ncol函数中使用了sapply()，那么从sm开始就不会有效只有4列。我认为对分位数使用sparseMatrix()函数可能并不总是这样做的原因。）

最快的方法是什么？假设s1, s2, s3, nrow, ncol, arr以上的变量如上定义。假设您想要s3的{{1}}分位数。你可以这样做

s2 = 1

这种方法可能会更好，但我认为对于大型数据集，tmp <- s2==1 quantile( c( s3[tmp], rep(0, nrow-sum(tmp)) ), q)方法应该运行良好

长格式的矩阵上的分位数

1 个答案: