基于点连通性的聚类

时间:2016-10-03 11:30:30

标签: r cluster-analysis

我有100万条lat长[5位数精度]和路线记录。我想聚集那些数据点。

我不想使用标准的k-means聚类,因为我不确定有多少clsuters [试过Elbow方法但不相信]。

这是我的逻辑 -

1)我想将lat long的宽度从5位减少到3位。

2)现在,在+/- 0.001范围内的lat long将聚集在一次簇中。计算集群的质心。

但是这样做我无法找到好的算法和R脚本来执行我的思维代码。

任何人都可以帮我解决上述问题。

谢谢,

1 个答案:

答案 0 :(得分:1)

可以基于connected components进行群集。

可以连接彼此相距+/- 0.001的所有点,因此我们将有一个包含子图的图形,每个子图可以是单个点或一系列连接点(连接的组件) 然后可以找到连接的组件并且可以计算它们的中心线。 此任务需要两个包:

1. deldir形成点的三角测量并指定哪些点相互对应并计算它们之间的距离。

2 igraph找到连接的组件。

library(deldir)
library(igraph)
coords <- data.frame(lat = runif(1000000),long=runif(1000000))

#round to 3 digits
coords.r <- round(coords,3)

#remove duplicates
coords.u <- unique(coords.r)

# create triangulation of points. depends on the data may take a while an consume more memory
triangulation <- deldir(coords.u$long,coords.u$lat)

#compute distance between adjacent points
distances <- abs(triangulation$delsgs$x1 - triangulation$delsgs$x2) +
            abs(triangulation$delsgs$y1 - triangulation$delsgs$y2)

#remove edges that are greater than .001
edge.list <- as.matrix(triangulation$delsgs[distances < .0011,5:6])
if (length(edge.list) == 0) { #there is no edge that its lenght is less than .0011
    coords.clustered <- coords.u
} else { # find connected components

    #reformat list of edges so that if the list is 
    #   9 5
    #   5 7
    #so reformatted to
    #   3 1
    #   1 2
    sorted <- sort(c(edge.list), index.return = TRUE)
    run.length <- rle(sorted$x)
    indices <- rep(1:length(run.length$lengths),times=run.length$lengths)
    edge.list.reformatted <- edge.list
    edge.list.reformatted[sorted$ix] <- indices

    #create graph from list of edges
    graph.struct <- graph_from_edgelist(edge.list.reformatted, directed = FALSE)

    # cluster based on connected components
    clust <- components(graph.struct)

    #computation of centroids
    coords.connected <- coords.u[run.length$values, ]
    centroids <- data.frame(lat = tapply(coords.connected$lat,factor(clust$membership),mean) ,
                           long = tapply(coords.connected$long,factor(clust$membership),mean))

    #combine clustered points with unclustered points
    coords.clustered <- rbind(coords.u[-run.length$values,], centroids)

    # round the data and remove possible duplicates
    coords.clustered <- round(coords.clustered, 3)
    coords.clustered <- unique(coords.clustered)
}