我有一个大约200万行的数据集,包括特定纬度和经度的各种属性。对于每个房产,我都有估价和建筑面积。估值已完成,但并非所有物业都有楼面面积。
我想使用一些最近邻居方法进行插值,以近似表中的特定缺失NaN
值。我的软件是用Python编写的,因此可能需要使用Numpy,Pandas,Scipy或其他组合。
我已经看过使用SciPy的cKDTree,以及使用Haversine formula计算距离的距离近似值,但是我所有的例子都是看到的是关于平面内插而不是填补缺失数据的间隙,我对如何实现这一点感到茫然。
举个例子,这是我作为测试数据使用的前几行(比率只是value/area
):
lat | long | value | area | ratio
----------|-----------|-------|-------|----------
57.101474 | -2.242851 | 12850 | 252.0 | 50.992063
57.102554 | -2.246308 | 14700 | 309.0 | 47.572816
57.100556 | -2.248342 | 25600 | 507.0 | 50.493097
57.101765 | -2.254688 | 28000 | 491.0 | 57.026477
57.097553 | -2.245483 | 5650 | 119.0 | 47.478992
57.098244 | -2.245768 | 43000 | 811.0 | 53.020962
57.098554 | -2.252504 | 46300 | 850.0 | 54.470588
57.102794 | -2.243454 | 7850 | 180.0 | 43.611111
57.101474 | -2.242851 | 26250 | NaN | NaN
57.101893 | -2.239883 | 31000 | NaN | NaN
57.101383 | -2.238955 | 28750 | NaN | NaN
57.104578 | -2.235641 | 18500 | 327.0 | 56.574924
57.105424 | -2.234953 | 21950 | 406.0 | 54.064039
57.105516 | -2.233683 | 19600 | 408.0 | 48.039216
属性本身可以进一步分组以获得更好的关系(这不是测试数据的一部分,但每个属性可以用于不同的目的,例如办公室,工厂,商店)。
我意识到我可以慢慢地循环,通过相隔距离来获取属性组(测试每个NaN
属性与其余属性)但这似乎是令人心碎的冰川。
df.to_clipboard()
输出(前15行):
lat long value area ratio
0 57.101474 -2.242851 12850 252.0 50.992063
1 57.102554 -2.246308 14700 309.0 47.572816
2 57.100556 -2.248342 25600 507.0 50.493097
3 57.101765 -2.254688 28000 491.0 57.026477
4 57.097553 -2.245483 5650 119.0 47.478992
5 57.098244 -2.245768 43000 811.0 53.020962
6 57.098554 -2.252504 46300 850.0 54.470588
7 57.102794 -2.243454 7850 180.0 43.611111
8 57.101474 -2.242851 26250 NaN NaN
9 57.101893 -2.239883 31000 NaN NaN
10 57.101383 -2.238955 28750 NaN NaN
11 57.104578 -2.235641 18500 327.0 56.574924
12 57.105424 -2.234953 21950 406.0 54.064039
13 57.105516 -2.233683 19600 408.0 48.039216
答案 0 :(得分:1)
如果您对图书馆开放,可以使用Distance matrix
假设你的主数据帧是
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
import pandas as pd
def find_closest(x, df):
#Supress itself
d = x.drop(x.name).to_dict()
#sort the distance
v = sorted(d, key=lambda k: d[k])
#Find the closest with a non nan area value else return NaN
for i in v :
if i in df[~df.area.isnull()].index:
return df.loc[i].ratio
else:
pass
return np.nan
df_matrix_distance = pd.DataFrame(euclidean_distances(df[["lat","long"]]))
#Get the null values in area
df_nan = df[df.area.isnull()]
#get the values
res = df_matrix_distance.loc[df_nan.index].apply(lambda x: find_closest(x,df), axis=1).to_dict()
# Fill the values
for k,v in res.items():
df.loc[k,"ratio"] = v
df.loc[k,"area"] = df.loc[k,"value"]/ df.loc[k,"ratio"]
结果
lat long value area ratio
0 57.101474 -2.242851 12850 252.0 50.992063
1 57.102554 -2.246308 14700 309.0 47.572816
2 57.100556 -2.248342 25600 507.0 50.493097
3 57.101765 -2.254688 28000 491.0 57.026477
4 57.097553 -2.245483 5650 119.0 47.478992
5 57.098244 -2.245768 43000 811.0 53.020962
6 57.098554 -2.252504 46300 850.0 54.470588
7 57.102794 -2.243454 7850 180.0 43.611111
8 57.101474 -2.242851 26250 514.0 50.99206349
9 57.101893 -2.239883 31000 607.0 51.00502513
10 57.101383 -2.238955 28750 563.0 51.00502513
11 57.104578 -2.235641 18500 327.0 56.574924
12 57.105424 -2.234953 21950 406.0 54.064039
13 57.105516 -2.233683 19600 408.0 48.039216