Question

我有一个大小为100k +的向量A，我想计算该向量的每个元素与每个其他元素之间的距离。我正在尝试使用R的内置 adist 函数并使用 stringdist 包来解决R中的此问题。问题在于它的计算量很大，并且可以连续运行数天而没有结束。

我要解决的最终问题是使用距离测量来找到重复项或近似重复项，然后围绕它建立某种分类模型。

我当前使用的代码是

 # declare an empty data frame and append data to it
matchedStr_vecA <- data.frame(row_index = integer(),
                              col_index = integer(),
                              vecA_i = character(),
                              vecA_j = character(),
                              dist_diff_vecA = double(),
                              stringsAsFactors=FALSE)


k = 1 # (keeps track of the pointer to the data frame)
# Run 2 different loops to calculate the bottom half of the matrix (below the diagonal - 
# as the diagonal elements will be zero and the upper half is the mirror image of the bottom half)
for (i in 1:length(vecA)) { 
  for (j in 1:length(vecA)) { 
    if (i < j) {
      dist_diff_vecA <- stringdist(vecA[i], vecA[j], method = "lv")
      matchedStr_invId[k,] <- c(i, j, vecA[i], vecA[j], dist_diff_vecA)
      k <- k + 1
    }
  }
}

请帮助我将计算从O（n ^ 2）引入O（n）。我也可以使用python。有人告诉我可以使用动态编程编程解决此问题，但我不确定如何实现。

谢谢

Answer 1

我在计算距离矩阵时遇到了同样的问题，并且已经在Python中成功解决了这个问题。这个问题讨论了解决方案的关键要素，以确保您在线程之间平均分配计算量： How to split diagonal matrix into equal number of items each along one of axis?

有两点要指出：

两个点之间的距离通常是对称的，因此您可以重复使用此数学功能并计算一次$(".message").click(function() { document.getElementById("timeline").innerHTML = '<div><p>Hello John</p><ul><li>Point A</li><li>Point B</li><li>Point C</li><li>Point D</li></ul></div>'; });和i元素之间的距离，然后存储它或将其重新用于{ {1}}和j。
除非您对结果不精确感到满意，否则无法在O（n ^ 2）以下优化算法。而且由于您是编程的新手，所以我什至不考虑那样做。
您应该能够使用索引拆分来并行化计算，正如我在上面的问题中所建议的那样，它是一种最佳选择。

R中大小为100k * 100k的矩阵的距离矩阵

1 个答案: