如何将字典作为列添加到pyspark中的数据框

时间:2019-08-01 20:54:41

标签: python dataframe pyspark user-defined-functions

我正在尝试根据生成排序字典的函数的结果在数据框中添加新列。但是我无法正常工作。

我正在使用Python 3.6,并使用本地spark会话在Pycharm上运行它。我尝试使用ArrayType,但这似乎无法解决。输出列为空

def getDictionary(data):

    hour = ['8', '9', '10']
    score = [data[0], data[1], data[2]]
    res = dict(zip(hour, score))

    sorted_x = sorted(res.items(), key=lambda kv: kv[1], reverse=1)
    sorted_dict = collections.OrderedDict(sorted_x)

    first3pairs = {k: sorted_dict[k] for k in list(sorted_dict)[:3]}
    return first3pairs
get_res_udf = F.udf(getDictionary, ArrayType(StringType()))

data = data.withColumn('result', get_res_udf(data['probability']))
data.show(10, False)

错误:

+----------+-------------------------+------+
|loannumber|scoring_ts_utc           |result|
+----------+-------------------------+------+
|  11111111|2019-08-01 19:33:18.98721|null  |
+----------+-------------------------+------+

预期:

+----------+-------------------------+-----------------------------------------------------------------------------------------------------------+
|loannumber|scoring_ts_utc           |result|
+----------+-------------------------+-----------------------------------------------------------------------------------------------------------+
|  11111111|2019-08-01 19:33:18.98721|{'8': 0.15553969938314824, '10': 0.1135606782079484, '12': 0.10158022312738095, '14': 0.08433517313467825}  |
+----------+-------------------------+-----------------------------------------------------------------------------------------------------------+

0 个答案:

没有答案