Question

问题

需要确定一种方法来查找每个点都有值的2英里点群集。识别2英里区域，其总和（值）> 50.

数据

我的数据类似于以下

ID        COUNT LATITUDE    LONGITUDE
187601546   20  025.56394   -080.03206
187601547   25  025.56394   -080.03206
187601548   4   025.56394   -080.03206
187601550   0   025.56298   -080.03285

大约200K的记录。我需要确定的是，在一英里半径（2英里直径）区域内是否存在超过65的计数总和的区域。

将每个点用作区域的中心

现在，我有来自另一个项目的python代码，它将在x直径点周围绘制一个shapefile，如下所示：

def poly_based_on_distance(center_lat,center_long, distance, bearing):
# bearing is in degrees
# distance in miles
# print ('center', center_lat, center_long)

    destination = (vincenty(miles=distance).destination(Point(center_lat, 
       center_long), bearing).format_decimal())

返回目的地的例行程序，然后查看半径内的哪些点。

## This is the evaluation for overlap between points and 
    ## area polyshapes
    area_list = []
    store_geo_dict = {}
    for stores in locationdict:
        location = Polygon(locationdict[stores])

        for areas in AREAdictionary:
            area = Polygon(AREAdictionary[areass])
            if store.intersects(area):
                area_list.append(areas)

        store_geo_dict[stores] = area_list
        area_list = []

此时，我只是在每个200K点周围绘制一个圆形形状文件，查看其他内部并进行计数。

需要聚类算法吗？

但是，可能存在具有所需计数密度的区域，其中一个点不在中心。

我熟悉使用属性进行分类的聚类算法，例如DBSCAN，但这是使用每个点的值找到密度聚类的问题。是否有任何聚类算法可以找到内径为> = 50的2英里直径圆的任何聚类？

任何建议，python或R都是首选工具，但这是开放性的，可能是一次性的，因此计算效率不是优先考虑的事项。

Answer 1

不是一个完整的解决方案，但它可能有助于简化问题，具体取决于数据的分布。我将在我的示例中使用平面坐标和cKDTree，如果您可以忽略投影中的曲率，这可能适用于地理数据。

主要观察如下：如果(x,y)周围的球2*r（例如2英里），则(x,y)点不会对密集群组做出贡献贡献小于截止值（例如标题中的50）。事实上，r (x,y)内的任何一点都不会影响蚂蚁密集群。

这允许您重复丢弃点。如果你没有积分，就没有密集的星团;如果留下一些点，可能存在聚类。

import numpy as np
from scipy.spatial import cKDTree

# test data
N = 1000
data = np.random.rand(N, 2)
x, y = data.T

# test weights of each point
weights = np.random.rand(N)


def filter_noncontrib(pts, weights, radius=0.1, cutoff=60):
    tree = cKDTree(pts)
    contribs = np.array(
        [weights[tree.query_ball_point(pt, 2 * radius)].sum() for pt in pts]
    )
    return contribs >= cutoff


def possible_contributors(pts, weights, radius=0.1, cutoff=60):
    n_pts = len(pts)
    while len(pts):
        mask = filter_noncontrib(pts, weights, radius, cutoff)
        pts = pts[mask]
        weights = weights[mask]

        if len(pts) == n_pts:
            break

        n_pts = len(pts)

    return pts

虚拟数据示例：

Answer 2

可以调整DBSCAN（参见广义DBSCAN;将核心点定义为权重和> = 50），但它不能确保最大簇大小（它计算传递闭包）。

您也可以尝试完整的链接。用它来查找具有所需最大直径的簇，然后检查它们是否满足所需的密度。但这并不能保证找到所有。

（a）为快速半径搜索构建索引可能更快。（b）对于每一点，找到半径为r的邻居;如果他们有所需的最低金额，请保留。但这并不能保证找到所有内容，因为中心不一定是数据点。考虑最大半径为1，最小权重为100.两个点的权重为50，每个点为（0,0）和（1,1）。（0,0）处的查询和（1,1）处的查询都不会发现解决方案，但（.5，.5）处的聚类不满足条件。

不幸的是，我相信你的问题至少是NP难的，所以你无法承受最终的解决方案。

识别具有最大距离直径和总和（重量）的加权簇> 50

问题

数据

将每个点用作区域的中心

需要聚类算法吗？

2 个答案: