Question

我正在尝试将坐标拖入最近的坐标。从某种意义上说，我正在进行一次kmeans聚类迭代，有1222个质心。下面我有一个功能，不完美，也太慢。我正在寻求有关改进此功能的帮助：

discretizeCourt <- function(x_loc, y_loc) {

  # create the dataframe of points that I want to round coordinates to
  y <- seq(0, 50, by = 2)
  x1 <- seq(1, 93, by = 2)
  x2 <- seq(2, 94, by = 2)
  x <- c(x1, x2)

  coordinates <- data.frame(
    x = rep(x, 13),
    y = rep(y, each = length(x1)),
    count = 0
  )

  # loop over each point in x_loc and y_loc
  # increment the count column whenever a point is 'near' that column      
  for(i in 1:length(x_loc)) {
    this_x = x_loc[i]
    this_y = y_loc[i]

    coordinates[coordinates$x > this_x-1 & 
                coordinates$x < this_x+1 & 
                coordinates$y > this_y-1 & 
                coordinates$y < this_y+1, ]$count =
      coordinates[coordinates$x > this_x-1 & 
                    coordinates$x < this_x+1 & 
                    coordinates$y > this_y-1 & 
                    coordinates$y < this_y+1, ]$count + 1
  }  
}

以下是我正在使用的一些测试数据：

> dput(head(x_loc, n = 50))
c(13.57165, 13.61702, 13.66478, 13.70833, 13.75272, 13.7946, 
13.83851, 13.86792, 13.8973, 13.93906, 13.98099, 14.02396, 14.06338, 
14.10872, 14.15412, 14.2015, 14.26116, 14.30871, 14.35056, 14.39536, 
14.43964, 14.48442, 14.5324, 14.57675, 14.62267, 14.66972, 14.71443, 
14.75383, 14.79012, 14.82455, 14.85587, 14.87557, 14.90737, 14.9446, 
14.97763, 15.01079, 15.04086, 15.06752, 15.09516, 15.12394, 15.15191, 
15.18061, 15.20413, 15.22896, 15.25411, 15.28108, 15.3077, 15.33578, 
15.36507, 15.39272)

> dput(head(y_loc, n = 50))
c(25.18298, 25.17431, 25.17784, 25.18865, 25.20188, 25.22865, 
25.26254, 25.22778, 25.20162, 25.25191, 25.3044, 25.35787, 25.40347, 
25.46049, 25.5199, 25.57132, 25.6773, 25.69842, 25.73877, 25.78383, 
25.82168, 25.86067, 25.89984, 25.93067, 25.96943, 26.01083, 26.05861, 
26.11965, 26.18428, 26.25347, 26.3352, 26.35756, 26.4682, 26.55412, 
26.63745, 26.72157, 26.80021, 26.8691, 26.93522, 26.98879, 27.03783, 
27.07818, 27.03786, 26.9909, 26.93697, 26.87916, 26.81606, 26.74908, 
26.67815, 26.60898)

我的实际x_loc和y_loc文件是~60000坐标，我有数千个文件，每个文件有~60000个坐标，所以这是很多工作。我很确定函数运行缓慢的原因是我索引/递增的方式。

计数不完美。技术上更好的方法是遍历所有60000个点（在该示例中仅高于50个点），并且对于每个点，计算该点与坐标数据帧中的每个点之间的距离（1222个点）。不过那是60000 * 1222的计算，只是针对这一套点，这太高了。

非常感谢任何帮助！谢谢，

编辑：我正在努力将我的数据帧/向量转换为2个矩阵，并对整个方法进行矢量化，会让你知道它是否有效。

Answer 1

如果您想比解决方案更快地处理矩阵，请考虑使用data.table库。请参阅以下示例：

df <- data.table(x_loc, y_loc) # Your data.frame is turned into a data.table
df$row.idx <- 1:nrow(df) # This column is used as ID for each sample point.

现在，我们可以找到每个点的正确坐标。稍后我们可以计算出某个坐标属于多少个点。我们首先保留coordinates数据框：

y <- seq(0, 50, by = 2)
x1 <- seq(1, 93, by = 2)
x2 <- seq(2, 94, by = 2)
x <- c(x1, x2)

coordinates <- data.frame(
     x = rep(x, 13),
     y = rep(y, each = length(x1)),
     count = 0
)
coordinates$row <- 1:nrow(coordinates) # Similar to yours. However, this time we are interested in seeing which points belong to this coordinate.

现在，我们定义一个函数来检查坐标并在相关点的一个单位距离内返回一个坐标。

f <- function(this_x, this_y, coordinates) {
     res <- coordinates[coordinates$x > this_x-1 & 
                             coordinates$x < this_x+1 & 
                             coordinates$y > this_y-1 & 
                             coordinates$y < this_y+1, ]$row
     res
}

对于每个点，我们找到它的正确坐标：

df[, coordinate.idx := f(x_loc, y_loc), by = row.idx]
df[, row.idx := NULL]

df包含以下变量：(x_loc, y_loc, coordinate.idx)。您可以使用此填充coordinates$count。即使是60000分，也不应该超过1秒。

for(i in 1:nrow(coordinates)) {
    coordinates$count = length(which(df$coordinate.idx == i))
}

在R中，将浮点坐标离散化为最近的坐标

1 个答案: