在大矩阵中的所有行中查找R中的特定值的优化方式

时间:2016-08-15 17:58:49

标签: r

我有一个大的稀疏矩阵1M X 10(1百万行和10列),我想查看矩阵中的每一行的值,并根据它创建一个新的向量。以下是我的代码。我想知道是否有任何方法可以优化它。

CreatenewVector <- function(TestMatrix){
    newColumn = c()
    for(i in 1:nrow(TestMatrix)){ ## Loop begins
        Value  = ifelse(1 %in% TestMatrix[i,],1,0)
        newColumn = c(newColumn,Value)
    } ##Loop ends 
    return(newColumn)
}
## SampleInput: TestMatrix = matrix(c(1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0), byrow = T, nrow = 4)

## Sampleoutput: = (1,1,1,0)
## In the input TestMatrix, each vector represents a row. for instance (1,0,0) is the first row and so on. 

1 个答案:

答案 0 :(得分:4)

假设您使用的是普通matrix对象,而不是特殊的稀疏矩阵类,则应使用rowSums

rowSums(x == 1) > 0

如果x是您的矩阵的名称。这将返回logical向量,如果您更喜欢1/0为true / false,则可以轻松强制numeric as.numeric()

为了给出一些时间感,我首先使用千行矩阵进行基准测试,然后使用一百万行矩阵:

gregor = function(x) {as.numeric(rowSums(x == 1L) > 0L)}

# original method in question
op1 = function(x){
    newColumn = c()
    for(i in 1:nrow(x)){ ## Loop begins
        Value  = ifelse(1 %in% x[i,],1,0)
        newColumn = c(newColumn,Value)
    } ##Loop ends 
    return(newColumn)
}

# modified original:
# eliminated unnecessary ifelse
# pre-allocated result vector (no growing in a loop!)
# saved numeric conversion to the end
op2 = function(x){
    newColumn = logical(nrow(x))
    for(i in 1:nrow(x)){ ## Loop begins
        newColumn[i]  = 1L %in% x[i,]
    } ##Loop ends 
    return(as.numeric(newColumn))
}

bouncy = function(x) {
    as.numeric(apply(x, 1, function(y) any(y == 1L)))
}

以下是千行矩阵的结果:

n = 1e3
x = matrix(sample(c(0L, 1L), size = n, replace = T), ncol = 4)
microbenchmark(gregor(x), op1(x), op2(x), bouncy(x), times = 20)

    # Unit: microseconds
#       expr      min        lq       mean    median        uq      max neval  cld
#  gregor(x)   12.164   15.7750   20.14625   20.1465   24.8980   30.410    20 a   
#     op1(x) 1224.736 1258.9465 1345.46110 1275.6715 1338.0105 2002.075    20    d
#     op2(x)  846.140  864.7655  935.46740  886.2425  951.4325 1287.075    20   c 
#  bouncy(x)  439.795  453.8595  496.96475  486.5495  508.0260  711.199    20  b   

使用rowSums是明显的赢家。我从一百万行矩阵的下一次测试中删除了OP1

n = 1e6
x = matrix(sample(c(0L, 1L), size = n, replace = T), ncol = 4)
microbenchmark(gregor(x), op2(x), bouncy(x), times = 30)
# Unit: milliseconds
#       expr        min        lq      mean    median         uq        max neval cld
#  gregor(x)   9.371777  10.02862  12.55963  10.61343   14.13236   27.70671    30 a  
#     op2(x) 822.171523 856.68916 937.23602 881.39219 1028.26738 1183.68569    30   c
#  bouncy(x) 391.604590 412.51063 502.61117 502.02431  588.78785  656.18824    30  b 

相对保证金更有利于rowSums