Python / Scipy:KDTree查询圆珠笔性能问题

时间:2017-03-31 08:35:53

标签: python scipy geospatial kdtree spatial-index

这个问题与Kaggle Two Sigma Rental Listings Challenge有关。它包含大约49.000行的训练数据集。在特征工程方面,我试图计算以下两个特征:

  1. 与任何其他商家信息的最小距离,表示此区域中商家信息的密度。假设:越密集,越有兴趣。
  2. 半径500米的商品数量。假设:a)接近我的列表的列表越多,兴趣就越高。 b)如果这些列表的地址不同,则该列表更有可能位于大型十字路口。
  3. 为了做到这一点,我使用了Scipy的KDTree。对于问题向下滚动。有关详细信息,请继续阅读。

      

    scipy.spatial.KDTree

    因此,我必须将经度和纬度转换为笛卡尔坐标。

    import pandas as pd
    
    df = pd.read_json('data/train.json')
    
    from math import *
    
    def to_Cartesian(lat, lng):
        R = 6367
    
        lat_, lng_ = map(radians, [lat, lng])
    
        x = R * cos(lat_) * cos(lng_)
        y = R * cos(lat_) * sin(lng_)
        z = R * sin(lat_)
        return x, y, z
    
    df['x'], df['y'], df['z'] = zip(*map(to_Cartesian, df['latitude'], df['longitude']))
    

    然后我从X,Y,Z笛卡尔坐标创建了一个KDTree,从而允许我使用KDTree的公制距离(公里和米)。

    coordinates = list(zip(df['x'], df['y'], df['z']))
    
    from scipy import spatial
    tree = spatial.KDTree(coordinates)
    

    现在我将KDTree作为索引,我能够查询它。为了回答上面的第一个问题,我使用了 KDTree.query 方法。

    import sys
    
    def get_min_distance(row, data, tree):
        # get the carthesian coordinates of the listing that is a prerequisite to query against the kd-tree
        coords = row['x'], row['y'], row['z']
        # query the 3 listings that are closest to the current listing
        closest = tree.query(coords, 3)
        # first array contains the euclidian distances. second array contains the indices
        distances, indices = closest[0], closest[1]
    
        # how many results did we get?
        length = len(distances)
        mdist = sys.maxsize
    
        # start at 1 to skip the original coordinates at index 0
        for i in range(1, length):
            idx = indices[i]
    
            distance = distances[i]
            if distance < mdist:
                mdist = distance
    
        return mdist
    
    df['min_distance_km'] = df.apply(lambda row: get_min_distance(row, df, tree), axis=1)
    

    然后我尝试应用 KDTree.query_ball_point 方法,以回答上面的第二个问题,即在单个列表周围找到500米半径范围内的列表。 问题:这会耗尽我的8GB内存并且永远不会完成。由于KDTree是一个空间索引,因此应该立即完成。 那么我做错了什么?

    #def get_neighbors(x, y, z):
    def get_neighbors(row):
        # get the carthesian coordinates to query against the kd-tree
        #coords = [x, y, z]
        coords = row['x'], row['y'], row['z']
        # query the indicies of those listings that are in a close range of 500 meters.
        indices = tree.query_ball_point(coords, 0.5)
    
        # how many results did we get? (minus 1 because the listing itself is included as well)
        length = len(indices)
        addresses = [] * (length-1)
    
        for i in range(1, length):
            idx = indices[i]
            addresses.append(idx)
            #address = df.get_value(df.index[idx], 'display_address')
            #addresses.append(str(address))
    
        return addresses
        #return length-1, addresses, len(set(addresses))
        #return length-1, len(set(addresses))
    
    df['neighborhood'] = df.apply(lambda row: get_neighbors(row), axis=1)
    #df['neighborhood'], df['addresses'], df['cnt_addresses'] = zip(*map(get_neighbors, df['x'], df['y'], df['z']))
    
    #df[df['neighborhood'] > 0].head(n=2)
    df.sort_values(by='neighborhood', ascending=False).head(n=5)
    #df.head(n=10)
    #df[df['neighborhood'] > df['addresses']].head(n=2)
    

    更新:我也尝试了这样的批处理方法:

    results = tree.query_ball_point(coordinates, 25)
    len(results)
    

0 个答案:

没有答案