Question

我发现一个R包myBrowser.Source = new Uri(string.Format("file:///{0}/index.html", Directory.GetCurrentDirectory()));，它使用多线程计算距离矩阵，并且做得很好。

但是，函数Rlof的输出是矢量而不是矩阵。将distmc应用于此“ dist”对象比使用多线程计算距离要昂贵得多。

看看function help，那里有打印对角线和上三角的选项，但是我不知道应该在哪里使用它们。

是否有某种方式可以节省as.matrix的时间？

可复制的示例：

as.matrix

Answer 1

`dist`返回什么？

此函数始终返回一个向量，该向量保留整个矩阵的下三角部分（按列）。 diag或upper参数仅影响打印，即stats:::print.dist。该函数可以将向量打印为矩阵。但实际上不是。

为什么`as.matrix`对“ dist”对象有害？

很难有效地处理三角矩阵或使其在R核中进一步对称。如果您的矩阵很大：R: Convert upper triangular part of a matrix to symmetric matrix，lower.tri和upper.tri可能会占用大量内存。

将“ dist”对象转换为矩阵的情况更糟。查看stats:::as.matrix.dist的代码：

function (x, ...) 
{
    size <- attr(x, "Size")
    df <- matrix(0, size, size)
    df[row(df) > col(df)] <- x
    df <- df + t(df)
    labels <- attr(x, "Labels")
    dimnames(df) <- if (is.null(labels)) 
    list(seq_len(size), seq_len(size))
    else list(labels, labels)
    df
}

使用row，col和t是一场噩梦。最后的"dimnames<-"生成另一个大的临时矩阵对象。当内存成为瓶颈时，一切都会变慢。

但是我们仍然可能需要一个完整的矩阵，因为它易于使用。

尴尬的是，使用完整矩阵更容易，因此我们需要它。请考虑以下示例：R - How to get row & column subscripts of matched elements from a distance matrix。如果我们尝试避免形成完整的矩阵，则操作很棘手。

一种 Rcpp 解决方案

我编写了一个Rcpp函数dist2mat（请参见此答案结尾的“ dist2mat.cpp”源文件）。

该函数有两个输入：“ dist”对象x和（整数）缓存阻塞因子bf。该函数的工作方式是首先创建一个矩阵并填充其下三角部分，然后将下三角部分复制到上三角以使其对称。第二步是典型的转置操作，对于大型矩阵缓存，阻塞是值得的。只要缓存大小不会太小或太大，性能就应该不敏感。通常选择128或256。

这是我第一次尝试使用Rcpp。我曾经是使用R的常规C接口的C程序员。但是我也要熟悉Rcpp。由于您不知道如何编写已编译的代码，因此您可能也不知道如何运行Rcpp函数。您需要

安装Rcpp软件包（如果您使用的是Windows，则不确定是否进一步需要Rtools）；
将我的“ dist2mat.cpp”复制到R的当前工作目录下的文件中（您可以在R会话中从getwd()获得它）。 “ .cpp”文件只是纯文本文件，因此您可以使用任何文本编辑器创建，编辑和保存它。

现在让我们开始展示。

library(Rcpp)
sourceCpp("dist2mat.cpp")  ## this takes some time; be patient

## a simple test with `dist2mat`
set.seed(0)
x <- dist(matrix(runif(10), nrow = 5, dimnames = list(letters[1:5], NULL)))
A <- dist2mat(x, 128)  ## cache blocking factor = 128
A
#          a         b         c         d         e
#a 0.0000000 0.9401067 0.9095143 0.5618382 0.4275871
#b 0.9401067 0.0000000 0.1162289 0.3884722 0.6968296
#c 0.9095143 0.1162289 0.0000000 0.3476762 0.6220650
#d 0.5618382 0.3884722 0.3476762 0.0000000 0.3368478
#e 0.4275871 0.6968296 0.6220650 0.3368478 0.0000000

结果矩阵保留传递给dist的原始矩阵/数据帧的行名。

您可以调整计算机上的缓存阻止因子。请注意，对于小型矩阵，缓存阻塞的影响并不明显。在这里，我尝试了10000 x 10000。

## mimic a "dist" object without actually calling `dist`
n <- 10000
x <- structure(numeric(n * (n - 1) / 2), class = "dist", Size = n)

system.time(A <- dist2mat(x, 64))
#   user  system elapsed 
#  0.676   0.424   1.113 

system.time(A <- dist2mat(x, 128))
#   user  system elapsed 
#  0.532   0.140   0.672 

system.time(A <- dist2mat(x, 256))
#   user  system elapsed 
#  0.616   0.140   0.759

我们可以用dist2mat对as.matrix进行基准测试。由于as.matrix占用大量RAM，因此我在这里使用一个小示例。

## mimic a "dist" object without actually calling `dist`
n <- 2000
x <- structure(numeric(n * (n - 1) / 2), class = "dist", Size = n)

library(bench)
bench::mark(dist2mat(x, 128), as.matrix(x), check = FALSE)
## A tibble: 2 x 14
#  expression         min   mean  median     max `itr/sec` mem_alloc  n_gc n_itr
#  <chr>         <bch:tm> <bch:> <bch:t> <bch:t>     <dbl> <bch:byt> <dbl> <int>
#1 dist2mat(x, …   24.6ms   26ms  25.8ms  37.1ms     38.4     30.5MB     0    20
#2 as.matrix(x)   154.5ms  155ms 154.8ms 154.9ms      6.46   160.3MB     0     4
## ... with 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
##   time <list>, gc <list>

请注意dist2mat如何更快（请参阅“平均值”，“中位数”）以及所需的RAM较少（请参见“ mem_alloc”）。我已将check = FALSE设置为禁用结果检查，因为dist2mat不返回“ dimnames”属性（“ dist”对象没有此类信息），但是as.matrix却返回（它设置了{{1 }}称为“别名”，因此它们并不完全相同。但是您可以验证它们都是正确的。

1:2000

“ dist2mat.cpp”

A <- dist2mat(x, 128)
B <- as.matrix(x)
range(A - B)
#[1] 0 0

距离物体上的as.matrix非常慢；如何使其更快？

1 个答案:

`dist`返回什么？

为什么`as.matrix`对“ dist”对象有害？

但是我们仍然可能需要一个完整的矩阵，因为它易于使用。

一种 Rcpp 解决方案

“ dist2mat.cpp”

距离物体上的as.matrix非常慢；如何使其更快？

1 个答案:

dist返回什么？

为什么as.matrix对“ dist”对象有害？

但是我们仍然可能需要一个完整的矩阵，因为它易于使用。

一种 Rcpp 解决方案

“ dist2mat.cpp”

`dist`返回什么？

为什么`as.matrix`对“ dist”对象有害？