我有一个字典列表,如下所示:
department_amount_pairs = [{"department_1": 100},{"department_2": 200},{"department_1": 300}]
我现在做的是
def department_udf(department_amount_pairs ):
pair = []
for d in department_amount_pairs:
pair.append(json.dumps(d))
return pair
这是我的udf定义
extractor = udf(department_udf,ArrayType(StringType()))
spark.udf.register("extractor_udf", extractor)
我就是这样调用这个函数的
data = data.withColumn('pairs',extractor_udf('department_amount'))
它返回 JSON 对象.. "[{"department_1": 100},{"department_2": 200},{"department_1": 300}]" 我必须做 json.loads() 来提取这个数组..但我希望我的 udf 返回一个 Array 的 Dictionaries
我尝试不使用 json.dumps 并将字典附加到列表中。但我得到 NONE 值..我也尝试将返回类型更改为 ArrayType(ArrayType()) 它也不起作用......
答案 0 :(得分:0)
您可以通过将 UDF 类型指定为 array<map<string,int>>
来返回字典数组。
例如
from pyspark.sql.functions import udf
def department_udf():
return [{"department_1": 100},{"department_2": 200},{"department_1": 300}]
extractor = udf(department_udf, 'array<map<string,int>>')
df = spark.range(1)
df.withColumn('pairs', extractor()).show(truncate=False)
+---+---------------------------------------------------------------------+
|id |pairs |
+---+---------------------------------------------------------------------+
|0 |[[department_1 -> 100], [department_2 -> 200], [department_1 -> 300]]|
+---+---------------------------------------------------------------------+