根据来自PySpark中另一个数据框的值映射具有Ar​​rayType的列的值

时间:2019-10-29 00:44:44

标签: pyspark

我所拥有的:

| ids.   |items   |item_id|value|timestamp|
+--------+--------+-------+-----+---------+
|[A,B,C] |1.0     |1      |5    |100      | 
|[A,B,D] |1.0     |2      |6    |90       | 
|[D]     |0.0.    |3      |7    |80       |
|[C]     |0.0.    |4      |8    |80       |
+--------+--------+-------+-----+----------
| ids    |id_num  |
+--------+--------+
|A       |1       |
|B       |2       |
|C       |3       |
|D       |4       |
+---+----+--------+

我想要什么:

| ids    |
+--------+
|[1,2,3] |      
|[1,2,4] |    
|[3]     | 
|[4]     | 
+--------+

有没有办法做到这一点而不会爆炸?谢谢您的帮助!

1 个答案:

答案 0 :(得分:0)

您可以使用UDF:

from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType

# Suppose this is the dictionary you want to map
map_dict = {'A':1, 'B':2,'C':3,'D':4}

def array_map(array_col):
    return list(map(map_dict.get, array_col))
"""
If you prefer list comprehension, you can return [map_dict[k] for k in array_col]
"""

array_map_udf = udf(array_map, ArrayType())

df = df.withColumn("mapped_array", array_map_udf(col("ids")))

我想不出其他方法,但是要获得并行字典,您可以使用toJSON方法。它将需要对您拥有的参考df进行进一步处理:

import json
df_json = df.toJSON().map(lambda x: json.loads(x))