Question

这个问题与Kaggle Two Sigma Rental Listings Challenge有关。它包含大约49.000行的训练数据集。在特征工程方面，我试图计算以下两个特征：

与任何其他商家信息的最小距离，表示此区域中商家信息的密度。假设：越密集，越有兴趣。
半径500米的商品数量。假设：a）接近我的列表的列表越多，兴趣就越高。 b）如果这些列表的地址不同，则该列表更有可能位于大型十字路口。

为了做到这一点，我使用了Scipy的KDTree。对于问题向下滚动。有关详细信息，请继续阅读。

scipy.spatial.KDTree

因此，我必须将经度和纬度转换为笛卡尔坐标。

import pandas as pd

df = pd.read_json('data/train.json')

from math import *

def to_Cartesian(lat, lng):
    R = 6367

    lat_, lng_ = map(radians, [lat, lng])

    x = R * cos(lat_) * cos(lng_)
    y = R * cos(lat_) * sin(lng_)
    z = R * sin(lat_)
    return x, y, z

df['x'], df['y'], df['z'] = zip(*map(to_Cartesian, df['latitude'], df['longitude']))

然后我从X，Y，Z笛卡尔坐标创建了一个KDTree，从而允许我使用KDTree的公制距离（公里和米）。

coordinates = list(zip(df['x'], df['y'], df['z']))

from scipy import spatial
tree = spatial.KDTree(coordinates)

现在我将KDTree作为索引，我能够查询它。为了回答上面的第一个问题，我使用了 KDTree.query 方法。

import sys

def get_min_distance(row, data, tree):
    # get the carthesian coordinates of the listing that is a prerequisite to query against the kd-tree
    coords = row['x'], row['y'], row['z']
    # query the 3 listings that are closest to the current listing
    closest = tree.query(coords, 3)
    # first array contains the euclidian distances. second array contains the indices
    distances, indices = closest[0], closest[1]

    # how many results did we get?
    length = len(distances)
    mdist = sys.maxsize

    # start at 1 to skip the original coordinates at index 0
    for i in range(1, length):
        idx = indices[i]

        distance = distances[i]
        if distance < mdist:
            mdist = distance

    return mdist

df['min_distance_km'] = df.apply(lambda row: get_min_distance(row, df, tree), axis=1)

然后我尝试应用 KDTree.query_ball_point 方法，以回答上面的第二个问题，即在单个列表周围找到500米半径范围内的列表。 问题：这会耗尽我的8GB内存并且永远不会完成。由于KDTree是一个空间索引，因此应该立即完成。 那么我做错了什么？

#def get_neighbors(x, y, z):
def get_neighbors(row):
    # get the carthesian coordinates to query against the kd-tree
    #coords = [x, y, z]
    coords = row['x'], row['y'], row['z']
    # query the indicies of those listings that are in a close range of 500 meters.
    indices = tree.query_ball_point(coords, 0.5)

    # how many results did we get? (minus 1 because the listing itself is included as well)
    length = len(indices)
    addresses = [] * (length-1)

    for i in range(1, length):
        idx = indices[i]
        addresses.append(idx)
        #address = df.get_value(df.index[idx], 'display_address')
        #addresses.append(str(address))

    return addresses
    #return length-1, addresses, len(set(addresses))
    #return length-1, len(set(addresses))

df['neighborhood'] = df.apply(lambda row: get_neighbors(row), axis=1)
#df['neighborhood'], df['addresses'], df['cnt_addresses'] = zip(*map(get_neighbors, df['x'], df['y'], df['z']))

#df[df['neighborhood'] > 0].head(n=2)
df.sort_values(by='neighborhood', ascending=False).head(n=5)
#df.head(n=10)
#df[df['neighborhood'] > df['addresses']].head(n=2)

更新：我也尝试了这样的批处理方法：

results = tree.query_ball_point(coordinates, 25)
len(results)

Python / Scipy：KDTree查询圆珠笔性能问题

0 个答案: