使用pyspark,将地图缩小为每行值最小的项的最佳方法是什么?
在下面的示例中,我只想执行首先发生的操作:
示例数据框:
+------+-----------------------+
| Name | Actions |
+------+-----------------------+
|Alice |{1978:'aaa',1981:'bbb'}|
|Jack |{1999:'xxx',1988:'yyy'}|
|Bill |{1992:'zzz'} |
+------+-----------------------+
所需DF:
+------+----------------------+
| Name | Actions |
+------+----------------------+
|Alice |{1978:'aaa'} |
|Jack |{1988:'yyy'} |
|Bill |{1992:'zzz'} |
+------+----------------------+
答案 0 :(得分:1)
使用map_keys
和map_values
转换为数组:
from pyspark.sql.functions import *
df = spark.createDataFrame([("Name", {1978: 'aaa', 1981: 'bbb'})], ("Name", "Actions"))
df_array = df.select(
"Name",
map_keys("Actions").alias("keys"),
map_values("Actions").alias("values")
)
用arrays_zip
合并,用array_sort
排序:
df_array_sorted = df_array.withColumn("sorted", arrays_zip("keys", "values"))
采用第一个元素,然后使用map_from_entries
转换回地图
df_array_sorted.select("Name", map_from_entries(array(col("sorted")[0])).alias("Actions")).show()
# +----+-------------+
# |Name| Actions|
# +----+-------------+
# |Name|[1981 -> bbb]|
# +----+-------------+