通过最近的已知邻居填充特定纬度/经度的丢失数据

时间:2017-09-21 13:06:22

标签: python pandas numpy scipy

我有一个大约200万行的数据集,包括特定纬度和经度的各种属性。对于每个房产,我都有估价和建筑面积。估值已完成,但并非所有物业都有楼面面积。

我想使用一些最近邻居方法进行插值,以近似表中的特定缺失NaN值。我的软件是用Python编写的,因此可能需要使用Numpy,Pandas,Scipy或其他组合。

我已经看过使用SciPy的cKDTree,以及使用Haversine formula计算距离的距离近似值,但是我所有的例子都是看到的是关于平面内插而不是填补缺失数据的间隙,我对如何实现这一点感到茫然。

举个例子,这是我作为测试数据使用的前几行(比率只是value/area):

lat       | long      | value | area  | ratio
----------|-----------|-------|-------|----------
57.101474 | -2.242851 | 12850 | 252.0 | 50.992063
57.102554 | -2.246308 | 14700 | 309.0 | 47.572816
57.100556 | -2.248342 | 25600 | 507.0 | 50.493097
57.101765 | -2.254688 | 28000 | 491.0 | 57.026477
57.097553 | -2.245483 | 5650  | 119.0 | 47.478992
57.098244 | -2.245768 | 43000 | 811.0 | 53.020962
57.098554 | -2.252504 | 46300 | 850.0 | 54.470588
57.102794 | -2.243454 | 7850  | 180.0 | 43.611111
57.101474 | -2.242851 | 26250 | NaN   | NaN
57.101893 | -2.239883 | 31000 | NaN   | NaN
57.101383 | -2.238955 | 28750 | NaN   | NaN
57.104578 | -2.235641 | 18500 | 327.0 | 56.574924
57.105424 | -2.234953 | 21950 | 406.0 | 54.064039
57.105516 | -2.233683 | 19600 | 408.0 | 48.039216

属性本身可以进一步分组以获得更好的关系(这不是测试数据的一部分,但每个属性可以用于不同的目的,例如办公室,工厂,商店)。

我意识到我可以慢慢地循环,通过相隔距离来获取属性组(测试每个NaN属性与其余属性)但这似乎是令人心碎的冰川。

df.to_clipboard()输出(前15行):

    lat         long        value   area    ratio
0   57.101474   -2.242851   12850   252.0   50.992063
1   57.102554   -2.246308   14700   309.0   47.572816
2   57.100556   -2.248342   25600   507.0   50.493097
3   57.101765   -2.254688   28000   491.0   57.026477
4   57.097553   -2.245483   5650    119.0   47.478992
5   57.098244   -2.245768   43000   811.0   53.020962
6   57.098554   -2.252504   46300   850.0   54.470588
7   57.102794   -2.243454   7850    180.0   43.611111
8   57.101474   -2.242851   26250   NaN     NaN
9   57.101893   -2.239883   31000   NaN     NaN
10  57.101383   -2.238955   28750   NaN     NaN
11  57.104578   -2.235641   18500   327.0   56.574924
12  57.105424   -2.234953   21950   406.0   54.064039
13  57.105516   -2.233683   19600   408.0   48.039216

1 个答案:

答案 0 :(得分:1)

如果您对图书馆开放,可以使用Distance matrix

假设你的主数据帧是

import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
import pandas as pd

def find_closest(x, df):
    #Supress itself
    d = x.drop(x.name).to_dict()
    #sort the distance
    v = sorted(d, key=lambda k: d[k])
    #Find the closest with a non nan area value else return NaN
    for i in v :
        if i in df[~df.area.isnull()].index:
            return df.loc[i].ratio
        else:
            pass
    return np.nan
df_matrix_distance = pd.DataFrame(euclidean_distances(df[["lat","long"]]))
#Get the null values in area
df_nan = df[df.area.isnull()]
#get the values
res = df_matrix_distance.loc[df_nan.index].apply(lambda x: find_closest(x,df), axis=1).to_dict()
# Fill the values
for k,v in res.items():
    df.loc[k,"ratio"] = v
    df.loc[k,"area"] = df.loc[k,"value"]/ df.loc[k,"ratio"]

结果

    lat         long        value   area    ratio
0   57.101474   -2.242851   12850   252.0   50.992063
1   57.102554   -2.246308   14700   309.0   47.572816
2   57.100556   -2.248342   25600   507.0   50.493097
3   57.101765   -2.254688   28000   491.0   57.026477
4   57.097553   -2.245483   5650    119.0   47.478992
5   57.098244   -2.245768   43000   811.0   53.020962
6   57.098554   -2.252504   46300   850.0   54.470588
7   57.102794   -2.243454   7850    180.0   43.611111
8   57.101474   -2.242851   26250   514.0   50.99206349
9   57.101893   -2.239883   31000   607.0   51.00502513
10  57.101383   -2.238955   28750   563.0   51.00502513
11  57.104578   -2.235641   18500   327.0   56.574924
12  57.105424   -2.234953   21950   406.0   54.064039
13  57.105516   -2.233683   19600   408.0   48.039216