从 UDF pyspark 返回字典列表

时间:2021-03-19 20:29:44

标签: python arrays apache-spark dictionary pyspark

我有一个字典列表,如下所示:

department_amount_pairs = [{"department_1": 100},{"department_2": 200},{"department_1": 300}]

我现在做的是

def department_udf(department_amount_pairs ):
    pair = []
    for d in department_amount_pairs:
         pair.append(json.dumps(d))
    return pair

这是我的udf定义

extractor = udf(department_udf,ArrayType(StringType()))
spark.udf.register("extractor_udf", extractor)

我就是这样调用这个函数的

data = data.withColumn('pairs',extractor_udf('department_amount'))

它返回 JSON 对象.. "[{"department_1": 100},{"department_2": 200},{"department_1": 300}]" 我必须做 json.loads() 来提取这个数组..但我希望我的 udf 返回一个 ArrayDictionaries

我尝试不使用 json.dumps 并将字典附加到列表中。但我得到 NONE 值..我也尝试将返回类型更改为 ArrayType(ArrayType()) 它也不起作用......

1 个答案:

答案 0 :(得分:0)

您可以通过将 UDF 类型指定为 array<map<string,int>> 来返回字典数组。

例如

from pyspark.sql.functions import udf

def department_udf():
    return [{"department_1": 100},{"department_2": 200},{"department_1": 300}]

extractor = udf(department_udf, 'array<map<string,int>>')

df = spark.range(1)

df.withColumn('pairs', extractor()).show(truncate=False)
+---+---------------------------------------------------------------------+
|id |pairs                                                                |
+---+---------------------------------------------------------------------+
|0  |[[department_1 -> 100], [department_2 -> 200], [department_1 -> 300]]|
+---+---------------------------------------------------------------------+