Question

有没有人听说任何与R的dist{stats}功能相同的包或功能创建了

距离矩阵，通过使用指定的距离度量来计算数据矩阵行之间的距离，

但是将一个sprase矩阵作为输入？

我的data.frame（名为dataCluster）有dims：7000 X 10000，几乎99％稀疏。在非稀疏的常规形式中，此功能似乎不会停止工作......

h1 <- hclust( dist( dataCluster ) , method = "complete" )

类似的问题没有答案： Sparse Matrix as input to Hierarchical clustering in R

Answer 1

你想要wordspace::dist.matrix。

它接受Matrix包中的稀疏矩阵（文档中不清楚），也可以跨距传输，输出Matrix和dist个对象等。< / p>

默认距离度量为'cosine'，因此如果需要，请务必指定method = 'euclidean'。

Answer 2

**更新：**实际上，您可以轻松完成qlcMatrix的操作：

sparse.cos <- function(x, y = NULL, drop = TRUE){
    if(!is.null(y)){
        if(class(x) != "dgCMatrix" || class(y) != "dgCMatrix") stop ("class(x) or class(y) != dgCMatrix")
        if(drop == TRUE) colnames(x) <- rownames(x) <- colnames(y) <- rownames(y) <- NULL
        crossprod(
            tcrossprod(
                x, 
                Diagonal(x = as.vector(crossprod(x ^ 2, rep(1, x@Dim[1]))) ^ -0.5)
            ),
            tcrossprod(
                y, 
                Diagonal(x = as.vector(crossprod(y ^ 2, rep(1, x@Dim[1]))) ^ -0.5))
            )
        )
    } else {
        if(class(x) != "dgCMatrix") stop ("class(x) != dgCMatrix")
        if(drop == TRUE) colnames(x) <- rownames(X) <- NULL
        crossprod(
            tcrossprod(
                x,
                Diagonal(x = as.vector(crossprod(x ^ 2, rep(1, nrow(x)))) ^ -0.5))
        )
    }
}

我发现上述内容与qlcMatrix::cosSparse之间的性能没有显着差异。

当数据稀疏度大于50％或正在输入矩阵的最长边上计算相似度（即高格式）时，

qlcMatrix::cosSparse比wordspace::dist.matrix快。

在稀疏度不同（稀疏度为10％，50％，90％或99％）的宽矩阵（1000 x 5000）上，wordspace::dist.matrix与qlcMatrix::cosSparse的性能计算得出1000 x 1000相似性：

# M1 is 10% sparse, M99 is 99% sparse
set.seed(123)
M10 <- rsparsematrix(5000, 1000, density = 1)
M50 <- rsparsematrix(5000, 1000, density = 0.5)
M90 <- rsparsematrix(5000, 1000, density = 0.1)
M99 <- rsparsematrix(5000, 1000, density = 0.01)
tM10 <- t(M10)
tM50 <- t(M50)
tM90 <- t(M90)
tM99 <- t(M99)
benchmark(
 "cosSparse: 10% sparse" = cosSparse(M10),
 "cosSparse: 50% sparse" = cosSparse(M50),
 "cosSparse: 90% sparse" = cosSparse(M90),
 "cosSparse: 99% sparse" = cosSparse(M99),
 "wordspace: 10% sparse" = dist.matrix(tM10, byrow = TRUE),
 "wordspace: 50% sparse" = dist.matrix(tM50, byrow = TRUE),
 "wordspace: 90% sparse" = dist.matrix(tM90, byrow = TRUE),
 "wordspace: 99% sparse" = dist.matrix(tM99, byrow = TRUE),
 replications = 2, columns = c("test", "elapsed", "relative"))

这两个函数相当可比，字空间在稀疏度较低时略有领先，但在稀疏度绝对不高：

                   test elapsed relative
1 cosSparse: 10% sparse   15.83  527.667
2 cosSparse: 50% sparse    4.72  157.333
3 cosSparse: 90% sparse    0.31   10.333
4 cosSparse: 99% sparse    0.03    1.000
5 wordspace: 10% sparse   15.23  507.667
6 wordspace: 50% sparse    4.28  142.667
7 wordspace: 90% sparse    0.36   12.000
8 wordspace: 99% sparse    0.09    3.000

如果我们翻转计算以计算5000 x 5000矩阵，则：

benchmark(
 "cosSparse: 50% sparse" = cosSparse(tM50),
 "cosSparse: 90% sparse" = cosSparse(tM90),
 "cosSparse: 99% sparse" = cosSparse(tM99),
 "wordspace: 50% sparse" = dist.matrix(M50, byrow = TRUE),
 "wordspace: 90% sparse" = dist.matrix(M90, byrow = TRUE),
 "wordspace: 99% sparse" = dist.matrix(M99, byrow = TRUE),
 replications = 1, columns = c("test", "elapsed", "relative"))

现在cosSparse的竞争优势变得非常明显：

                   test elapsed relative
1 cosSparse: 50% sparse   10.58  151.143
2 cosSparse: 90% sparse    1.44   20.571
3 cosSparse: 99% sparse    0.07    1.000
4 wordspace: 50% sparse   11.41  163.000
5 wordspace: 90% sparse    2.39   34.143
6 wordspace: 99% sparse    0.64    9.143

在稀疏度为50％时，效率的变化不是很大，但在稀疏度为90％时，字空间慢了1.6倍，而稀疏度为99％时，字空间慢了近10倍！

将此性能与方矩阵进行比较：

M50.square <- rsparsematrix(1000, 1000, density = 0.5)
tM50.square <- t(M50.square)
M90.square <- rsparsematrix(1000, 1000, density = 0.1)
tM90.square <- t(M90.square)

benchmark(
 "cosSparse: square, 50% sparse" = cosSparse(M50.square),
 "wordspace: square, 50% sparse" = dist.matrix(tM50.square, byrow = TRUE),
 "cosSparse: square, 90% sparse" = cosSparse(M90.square),
 "wordspace: square, 90% sparse" = dist.matrix(tM90.square, byrow = TRUE),
 replications = 5, columns = c("test", "elapsed", "relative"))

cosSparse在50％的稀疏度下都快一点，在90％的稀疏度下快两倍！

                           test elapsed relative
1 cosSparse: square, 50% sparse    2.12    9.217
3 cosSparse: square, 90% sparse    0.23    1.000
2 wordspace: square, 50% sparse    2.15    9.348
4 wordspace: square, 90% sparse    0.40    1.739

请注意，wordspace::dist.matrix比qlcMatrix::cosSparse拥有更多的边缘大小写检查，并且还允许通过R中的openmp进行并行化。此外，wordspace::dist.matrix支持欧几里得距离和雅卡德距离度量这些要慢得多。该软件包还内置了许多其他方便的功能。

也就是说，如果您只需要余弦相似度，并且矩阵的稀疏度大于50％，并且计算的比较高，则cosSparse应该是首选工具。

R中对dist函数有稀疏支持吗？

2 个答案: