我所拥有的:
| ids. |items |item_id|value|timestamp|
+--------+--------+-------+-----+---------+
|[A,B,C] |1.0 |1 |5 |100 |
|[A,B,D] |1.0 |2 |6 |90 |
|[D] |0.0. |3 |7 |80 |
|[C] |0.0. |4 |8 |80 |
+--------+--------+-------+-----+----------
| ids |id_num |
+--------+--------+
|A |1 |
|B |2 |
|C |3 |
|D |4 |
+---+----+--------+
我想要什么:
| ids |
+--------+
|[1,2,3] |
|[1,2,4] |
|[3] |
|[4] |
+--------+
有没有办法做到这一点而不会爆炸?谢谢您的帮助!
答案 0 :(得分:0)
您可以使用UDF:
from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType
# Suppose this is the dictionary you want to map
map_dict = {'A':1, 'B':2,'C':3,'D':4}
def array_map(array_col):
return list(map(map_dict.get, array_col))
"""
If you prefer list comprehension, you can return [map_dict[k] for k in array_col]
"""
array_map_udf = udf(array_map, ArrayType())
df = df.withColumn("mapped_array", array_map_udf(col("ids")))
我想不出其他方法,但是要获得并行字典,您可以使用toJSON
方法。它将需要对您拥有的参考df进行进一步处理:
import json
df_json = df.toJSON().map(lambda x: json.loads(x))