GroupPy在Pyspark中的运营

时间:2019-09-16 12:02:24

标签: pyspark pyspark-sql pyspark-dataframes

我有一个数据框,已根据纬度和经度计算了正弦距离。我想找到最小距离,id和store_code。

数据框看起来像-

+---+---------+---------+-----+-----+--------+---------+
| id| user_lat| user_lon|s_lat|s_lon|store_no|  dist_km|
+---+---------+---------+-----+-----+--------+---------+
|  1|13.031885|80.235574|29.91|73.88|      22| 1988.047|
|  1|13.031885|80.235574|28.57|77.33|      23| 1754.225|
|  1|13.031885|80.235574|26.86|80.95|      24|1539.8511|
|  2|19.099819|72.915288|29.91|73.88|      22|1206.3154|
|  3| 22.22698| 84.83607|29.91|73.88|      22|1387.3323|
|  2|19.099819|72.915288|28.57|77.33|      23|1144.7731|
|  2|19.099819|72.915288|26.86|80.95|      24|1191.7048|
|  3| 22.22698| 84.83607|28.57|77.33|      23|1032.1859|
|  3| 22.22698| 84.83607|26.86|80.95|      24| 648.0673|
+---+---------+---------+-----+-----+--------+---------+

I want my final df should be - 

+---+---------+---------+-----+-----+--------+---------+
| id| user_lat| user_lon|s_lat|s_lon|store_no|  dist_km|
+---+---------+---------+-----+-----+--------+---------+
|  1|13.031885|80.235574|26.86|80.95|      24|1539.8511|
|  2|19.099819|72.915288|28.57|77.33|      23|1144.7731|
|  3| 22.22698| 84.83607|26.86|80.95|      24| 648.0673|
+---+---------+---------+-----+-----+--------+---------+

1 个答案:

答案 0 :(得分:0)

这应该对您有用-

以最小的距离创建nearest_store_df

import pyspark.sql.functions as psf

nearest_store_df = df\
    .groupBy('id')\
    .agg(psf.min('dist_km').alias('min_dist_km'))

现在将nearest_store_df与原始数据框一起加入。

df\
    .join(nearest_store_df, df.dist_km == nearest_store_df.min_dist_km,'inner')\
    .show()