我有一个数据框,已根据纬度和经度计算了正弦距离。我想找到最小距离,id和store_code。
数据框看起来像-
+---+---------+---------+-----+-----+--------+---------+
| id| user_lat| user_lon|s_lat|s_lon|store_no| dist_km|
+---+---------+---------+-----+-----+--------+---------+
| 1|13.031885|80.235574|29.91|73.88| 22| 1988.047|
| 1|13.031885|80.235574|28.57|77.33| 23| 1754.225|
| 1|13.031885|80.235574|26.86|80.95| 24|1539.8511|
| 2|19.099819|72.915288|29.91|73.88| 22|1206.3154|
| 3| 22.22698| 84.83607|29.91|73.88| 22|1387.3323|
| 2|19.099819|72.915288|28.57|77.33| 23|1144.7731|
| 2|19.099819|72.915288|26.86|80.95| 24|1191.7048|
| 3| 22.22698| 84.83607|28.57|77.33| 23|1032.1859|
| 3| 22.22698| 84.83607|26.86|80.95| 24| 648.0673|
+---+---------+---------+-----+-----+--------+---------+
I want my final df should be -
+---+---------+---------+-----+-----+--------+---------+
| id| user_lat| user_lon|s_lat|s_lon|store_no| dist_km|
+---+---------+---------+-----+-----+--------+---------+
| 1|13.031885|80.235574|26.86|80.95| 24|1539.8511|
| 2|19.099819|72.915288|28.57|77.33| 23|1144.7731|
| 3| 22.22698| 84.83607|26.86|80.95| 24| 648.0673|
+---+---------+---------+-----+-----+--------+---------+
答案 0 :(得分:0)
这应该对您有用-
以最小的距离创建nearest_store_df
。
import pyspark.sql.functions as psf
nearest_store_df = df\
.groupBy('id')\
.agg(psf.min('dist_km').alias('min_dist_km'))
现在将nearest_store_df
与原始数据框一起加入。
df\
.join(nearest_store_df, df.dist_km == nearest_store_df.min_dist_km,'inner')\
.show()