Question

我正在处理一个数据集，它提供了大量点的坐标。每行对应一个点，列给出点的x坐标和y坐标。我试图重新排序行，这样，如果点X最接近点Y，那么行X最接近行Y.

我已经编写了以下代码，但R需要很长时间才能运行它，所以我想知道你是否可以帮我写一个更快的代码：

# first I create a function that calculates the distance between the points associated with row a and row b:
distance = function(a,b) {
  u <- c(d[a,5], d[a,6]) ### d is the data frame whose rows I am re-ordering; its 5th column gives the x-coordinate, the 6th its y-coordinate
  v <- c(d[b,5], d[b,6])
  dist <- sqrt((u[1]-v[1])^2 + (u[2] - v[2])^2)
  return(dist)
}

h <- rep(0, nrow(d)) 
l <- rep(0, nrow(d)) ### I will put in this variable the correct order of the rows numbers
l[1] <- 1 ### I start with row 1

 for(i in 1:nrow(d)) {
  if(i == 1) {  ### I calculate the distance between the first point and all the other points
     for(j in 1:length(h)) {
    h[j] <- distance(1,j)
    }
   }
  else {
    for(j in 1:length(h)) { ### I calculate the distance between the point considered (l[i-1]) and all the other points
    h[j] <- distance(l[i-1],j)
    }
  }
  k <- h[!is.nan(h)] ### for some reasons I get NaN (not sure why) and that makes the min() function below output nothing interesting 
  l[i+1] <- which(h == min(k[-c(i,l[1:i])])) ### I get the row number who is closest to the point considered (e.g. the smallest value in h) that is not the row itself or rows already considered
}

谢谢！

Answer 1

这是一个完全不同的解决方案，但您正在计算距离并尝试“集群”您的列，因此它可能很有用。

set.seed(200)
data <- data.frame(x = sample(1:100, 5),
               y = sample(1:200, 5))

hc <- hclust(dist(data))
data2 <- data[hc$order,]

dist计算所有数据行之间的欧氏距离。然后hclust使用该距离来聚类点。我只是使用默认方法，但您也可以更改它们。然后，您可以使用hclust给出的顺序重新组织数据。这应该很快。

使代码运行得更快（重新排序对应于平面中的点的行）

1 个答案: