Question

对Python不熟悉我想知道确定DF1中每条记录的最佳方法，DF2中的行对应于参数涉及两个DF的函数值的最小值。

在DF1中有数十万条记录，其列为lat1和lon1，DF2中有50,000条记录，其中包含lat2，lon2和zip列。我想应用函数f（lat1，lon1，lat2，lon2）来计算两点之间的距离（使用lat1，lon1，lat2，lon2定义）。我最终想要将DF2中的zip添加到与D2中的记录对应的Df1中，该记录对应于Df1中该行与D2中所有行之间的最小距离。

Answer 1

如果你需要进行50亿次计算，你会希望它快速。我生成了2个随机数据集：带有纬度和经度列的df1和带有经度，纬度和zip列的df2。 df1有10,000行，df2有50,000。对于df1中的10,000行，在df1中运行大约需要18秒（我有8个核心）或每条记录0.001805。因此，您需要花费大约3分钟（或稍差一点）才能获得100,000分钟。

%%file lat_long.py

import pandas as pd
import numpy as np
from multiprocessing import Pool

###############  Generate random data  ##################
d1 = np.random.randn(20000).reshape((10000, 2))
d2 = np.random.randn(50000*3).reshape((50000, 3))

global df1
df1 = pd.DataFrame(d1, columns = ['lat1', 'lon1'])
global df2
df2 = pd.DataFrame(d2, columns = ['lat2', 'lon2', 'zip'])
#########################################################

def min_gen(a1, a2, n):
    A = a1.lat1[n] - a2.lat2
    A = A*A
    B = a1.lon1[n] - a2.lon2
    B = B*B
    C = np.sqrt(A + B)
    tmp = np.arange(50000).reshape((50000,1))
    D = np.c_[C, tmp]
    return D

def main(i):
    min_arr = min_gen(df1, df2, i)
    return i, min(min_arr[:,0]), min_arr[:,0].argmin()

if __name__ == '__main__':
    p = Pool()
    r = p.map(main, range(len(df1)))
    print r

# <next cell>
%%bash 
lat_long.py

该程序将并行循环并计算这些距离的距离和最小值。 print r将打印一个元组列表，其中包含来自df1的行＃，最小距离以及与min对应的df2中的行＃（因此您可以找到邮政编码）。我将留给您收集邮政编码并安排数据集。

Answer 2

以下代码应该有效，下面的列表推导迭代第一帧中每一行的第二帧中的每个项目。值和索引存储在元组中。使用选择第一个元素的lambda可以找到最小值。然后通过映射不同的lambda来提取索引，该lambda仅选择第二个元素。这是对lambdas的一个很好的解释。 http://www.secnetix.de/olli/Python/lambda_functions.hawk

ldf1 = len(list(df1.iterrows()))
ldf2 = len(list(df2.iterrows()))
funk = lambda df1, df2, j, i:f(df1.loc[j, 'lat'], df1.loc[j, 'lon'],df2.loc[i,'lat'], df2.loc[i, 'lon'])
pairs = [min([(funk(DF1, DF2, j, i), i) for i in xrange(ldf2)], key=lambda x:x[0]) for j in xrange(ldf1)]
mins = map(lambda x:x[1], pairs)

值得注意的是，这将在多项式时间内运行，这将花费你一段时间的行数。我选择使用地图和列表推导，因为它们比标准for each

更快

在Python中查找MinArg - Pandas DFs距离

2 个答案: