Question

这是一棵树。第一列是分支的标识符，其中0是主干，L是左边的第一个分支，R是右边的第一个分支。 LL是第二次分叉后最左边的分支等。变量length包含每个分支的长度。

> tree
  branch length
1      0     20
2      L     12
3     LL     19
4      R     19
5     RL     12
6    RLL     10
7    RLR     12
8     RR     17

tree = data.frame(branch = c("0","L", "LL", "R", "RL", "RLL", "RLR", "RR"), length=c(20,12,19,19,12,10,12,17))
tree$branch = as.character(tree$branch)

这是这棵树的图纸

enter image description here

这树上有两个位置

posA = tree[4,]; posA$length = 12
posB = tree[6,]; posB$length = 3

位置由分支ID和距离（变量length）给出分支的原点（编辑中的更多信息）。

我编写了以下凌乱的distance函数来计算树上任意两点之间沿树枝的最短距离。 沿着分支的最短距离可以理解为蚂蚁沿着树枝走到另一个位置到达一个位置所需的最小距离。

distance = function(tree, pos1, pos2){
    if (identical(pos1$branch, pos2$branch)){Dist=pos1$length-pos2$length;return(Dist)}
    pos1path = strsplit(pos1$branch, "")[[1]]
    if (pos1path[1]!="0") {pos1path = c("0", pos1path)}
    pos2path = strsplit(pos2$branch, "")[[1]]
    if (pos2path[1]!="0") {pos2path = c("0", pos2path)}
    loop = 1:min(length(pos1path), length(pos2path))
    loop = loop[-which(loop == 1)]

    CommonTrace="included"; for (i in loop) {
        if (pos1path[i] != pos2path[i]) {
            CommonTrace = i-1; break
            }
        }

    if(CommonTrace=="included"){
        CommonTrace = min(length(pos1path), length(pos2path))
        if (length(pos1path) > length(pos2path)) {
            longerpos = pos1; shorterpos = pos2; longerpospath = pos1path
        } else {
            longerpos = pos2; shorterpos = pos1; longerpospath = pos2path
        }
        distToNode = 0
        if ((CommonTrace+1) != length(longerpospath)){
            for (i in (CommonTrace+1):(length(longerpospath)-1)){
                distToNode = distToNode + tree$length[tree$branch == paste0(longerpospath[2:i], collapse='')]
            }   
        }
        Dist = distToNode + longerpos$length + (tree[tree$branch == shorterpos$branch,]$length-shorterpos$length)
        if (identical(shorterpos, pos1)){Dist=-Dist}
        return(Dist)
    } else { # if they are sisterbranch
        Dist=0 
        if((CommonTrace+1) != length(pos1path)){
            for (i in (CommonTrace+1):(length(pos1path)-1)){
                Dist = Dist + tree$length[tree$branch == paste0(pos1path[2:i], collapse='')]
            }   
        }
        if((CommonTrace+1) != length(pos2path)){
            for (i in (CommonTrace+1):(length(pos2path)-1)){
                Dist = Dist + tree$length[tree$branch == paste(pos2path[2:i], collapse='')]
            }
        }
        Dist = Dist + pos1$length + pos2$length
        return(Dist)
    }
}

我认为算法工作正常，但效率不高。注意距离的重要标志。只有在“姐妹分支”上找不到两个职位时，这个标志才有意义。只有当两个位置中的一个位于根部和另一个位置之间的路径中时，这个符号才有意义。

distance(tree, posA, posB) # -22

然后我就这样绕过所有感兴趣的位置：

allpositions=rbind(tree, tree)
allpositions$length = c(1,5,8,2,2,3,5,6,7,8,2,3,1,2,5,6)
mat = matrix(-1, ncol=nrow(allpositions), nrow=nrow(allpositions))
    for (i in 1:nrow(allpositions)){
       for (j in 1:nrow(allpositions)){
          posA = allpositions[i,]
          posB = allpositions[j,]
          mat[i,j] = distance(tree, posA, posB)
       }
    }

#     1   2   3   4   5   6   7   8  9  10  11  12  13  14  15  16
# 1   0 -24 -39 -21 -40 -53 -55 -44 -6 -27 -33 -22 -39 -52 -55 -44
# 2  24   0 -15   7  26  39  41  30 18  -3  -9   8  25  38  41  30
# 3  39  15   0  22  41  54  56  45 33  12   6  23  40  53  56  45
# 4  21   7  22   0 -19 -32 -34 -23 15  10  16  -1 -18 -31 -34 -23
# 5  40  26  41  19   0 -13 -15   8 34  29  35  18   1 -12 -15   8
# 6  53  39  54  32  13   0   8  21 47  42  48  31  14   1   8  21
# 7  55  41  56  34  15   8   0  23 49  44  50  33  16   7   0  23
# 8  44  30  45  23   8  21  23   0 38  33  39  22   7  20  23   0
# 9   6 -18 -33 -15 -34 -47 -49 -38  0 -21 -27 -16 -33 -46 -49 -38
# 10 27   3 -12  10  29  42  44  33 21   0  -6  11  28  41  44  33
# 11 33   9  -6  16  35  48  50  39 27   6   0  17  34  47  50  39
# 12 22   8  23   1 -18 -31 -33 -22 16  11  17   0 -17 -30 -33 -22
# 13 39  25  40  18  -1 -14 -16   7 33  28  34  17   0 -13 -16   7
# 14 52  38  53  31  12  -1   7  20 46  41  47  30  13   0   7  20
# 15 55  41  56  34  15   8   0  23 49  44  50  33  16   7   0  23
# 16 44  30  45  23   8  21  23   0 38  33  39  22   7  20  23   0

举个例子，让我们考虑对象allpositions中的第一个和第三个位置。它们之间的距离为39（和-39），因为蚂蚁需要在分支0上行走19个单位，然后在分支L上行走12个单位，最后是蚂蚁需要在分支8上走LL个单位。 19 + 12 + 8 = 39

问题是我有大约20棵非常大的树，大约有50000个位置，我想计算任意两个位置之间的距离。因此，计算的距离为20 * 50000 ^ 2。它需要永远！你能帮我改进我的代码吗？

修改

如果仍然不清楚，请告诉我

tree是对树的描述。树具有某个length的分支。分支的名称（变量：branch）表示分支之间的关系。分支RL是两个分支RLL和RLR的“父分支”，其中R和L代表左右。

allpositions是一个data.frame，其中每一行代表树上的一个独立位置。你可以想到一只松鼠的位置。该职位由两个信息定义。 1）松鼠站立的分支（变量：branch）以及分支开始与松鼠位置之间的距离（变量：length）。

三个例子

考虑在分支length（长度为12）处的第一只松鼠（变量：RL）8和位于第二只松鼠（变量：{{1} }）2分支length或RLL。两只松鼠之间的距离是12 - 8 + 2 = 6（或-6）。

考虑分支RLR上的第一只松鼠（变量：length）8和分支上的第二只松鼠（变量：RL）2 length。两只松鼠之间的距离是8 + 2 = 10（或-10）。

考虑在分支RR（长度为19）处的第一只松鼠（变量：length）8和位于第二只松鼠（变量：{{1}分支R上的2）。知道该分支length的长度为12，两只松鼠之间的距离为19 - 8 + 12 + 2 = 25（或-25）。

Answer 1

下面的代码使用igraph包计算tree中位置之间的距离，并且明显快于您在问题中发布的代码。方法是在分支交叉点和allpositions中指定的位置沿树枝的位置创建图顶点。图形边缘是这些顶点之间的分支段。它使用igraph为树构建图形并allpositions，然后查找与allposition数据对应的顶点之间的距离。

t.graph <- function(tree, positions) {
  library(igraph)
  #  Assign vertex name to tree branch intersections
  n_label <- nchar(tree$branch)
  tree$high_vert <- tree$branch
  tree$low_vert <- tree$branch
  tree$brnch_type <- "tree"
  for( i in 1:nrow(tree) ) {
    tree$low_vert[i] <- if(n_label[i] > 1) substr(tree$branch[i], 1, n_label[i]-1)
    else { if(tree$branch[i] %in% c("R","L")) "0"
           else "root" }
  }
  #  combine position data with tree data    
  positions$brnch_type <- "position"
  temp <- merge(positions, tree, by = "branch")
  positions <- temp[, c("branch","length.x","high_vert","low_vert","brnch_type.x")]
  positions$high_vert <- paste(positions$branch, positions$length.x, sep="_")
  colnames(positions) <- c("branch","length","high_vert","low_vert","brnch_type")
  tree <- rbind(tree, positions)
  #   use positions to segment tree branches    
  tree_brnch <- split(tree, tree$branch)
  tree <- data.frame( branch=NA_character_, length = NA_real_, high_vert = NA_character_, 
                      low_vert = NA_character_, brnch_type =NA_character_, seg_len= NA_real_)
  for( ib in 1: length(tree_brnch)) {
    brnch_seg <- tree_brnch[[ib]][order(tree_brnch[[ib]]$length, decreasing=TRUE), ]
    n_seg <- nrow(brnch_seg)
    brnch_seg$seg_len <- brnch_seg$length
    for( is in 1:(n_seg-1) ) {
      brnch_seg$seg_len[is] <- brnch_seg$length[is] - brnch_seg$length[is+1]
      brnch_seg$low_vert[is] <- brnch_seg$high_vert[is+1]
    }
    tree  <- rbind(tree, brnch_seg)
  }
  tree <- tree[-1,]
  #  Create graph of tree and positions
  tree_graph <- graph.data.frame(tree[,c("low_vert","high_vert")])  
  E(tree_graph)$label <- tree$high_vert
  E(tree_graph)$brnch_type <- tree$brnch_type
  E(tree_graph)$weight <- tree$seg_len
  #  calculate shortest distances between position vertices
  position_verts <- V(tree_graph)[grep("_", V(tree_graph)$name)] 
  vert_dist <- shortest.paths(tree_graph, v=position_verts, to=position_verts, mode="all")  
  return(dist_mat= vert_dist )
}

我已经针对您的问题中发布的代码对igraph代码（t.graph函数）进行了基准测试，为Remi数据创建了代码allposition的函数你的distance功能。样本树被创建为树的扩展，allpositions数据用于64,256和2048个分支的树，allpositions等于这些大小的两倍。执行时间的比较如下所示。请注意，时间是毫秒。

 microbenchmark(matR16 <- Remi(tree, allpositions), matG16 <- t.graph(tree, allpositions),
                matR256 <- Remi(tree256, allpositions256), matG256 <- t.graph(tree256, allpositions256), times=2)
Unit: milliseconds
                                         expr          min           lq         mean       median           uq          max neval
           matR8 <- Remi(tree, allpositions)     58.82173     58.82173     59.92444     59.92444     61.02714     61.02714     2
        matG8 <- t.graph(tree, allpositions)     11.82064     11.82064     13.15275     13.15275     14.48486     14.48486     2
    matR256 <- Remi(tree256, allpositions256) 114795.50865 114795.50865 114838.99490 114838.99490 114882.48114 114882.48114     2
 matG256 <- t.graph(tree256, allpositions256)    379.54559    379.54559    379.76673    379.76673    379.98787    379.98787     2

与您发布的代码相比，igraph结果对于8分支的情况仅快了约5倍，但对于256个分支的速度快了300倍，因此igraph似乎随着大小更好地扩展。我还使用以下结果对2048分支案例的igraph代码进行了基准测试。再次是毫秒。

microbenchmark(matG8 <- t.graph(tree, allpositions), matG64 <- t.graph(tree64, allpositions64),
               matG256 <- t.graph(tree256, allpositions256),  matG2k <- t.graph(tree2k, allpositions2k), times=2)
Unit: milliseconds
                                         expr         min          lq        mean      median          uq         max neval
         matG8 <- t.graph(tree, allpositions)    11.78072    11.78072    12.00599    12.00599    12.23126    12.23126     2
    matG64 <- t.graph(tree64, allpositions64)    73.29006    73.29006    73.49409    73.49409    73.69812    73.69812     2
 matG256 <- t.graph(tree256, allpositions256)   377.21756   377.21756   410.01268   410.01268   442.80780   442.80780     2
    matG2k <- t.graph(tree2k, allpositions2k) 11311.05758 11311.05758 11362.93701 11362.93701 11414.81645 11414.81645     2

因此，在不到12秒的时间内计算出约4000个位置的距离矩阵。 t.graph返回距离矩阵，其中矩阵的行和列标有branch names - position on the branch，例如

      0_7 0_1 L_8 L_5 LL_8 LL_2 R_3 R_2 RL_2 RL_1 RLL_3 RLL_2 RLR_5 RR_6
L_5    18  24   3   0   15    9   8   7   26   25    39    38    41   30

显示从L-5到L分支的5个单位到其他位置的距离。我不知道这会处理你最大的案件，但对某些人来说可能会有所帮助。您的最大案例的存储要求也存在问题。

计算树上两个位置之间距离的性能？

1 个答案: