pyspark嵌套数组爆炸

时间:2018-07-13 20:39:34

标签: apache-spark pyspark

我的数据框如下:

df = spark.createDataFrame(
    [
        (
            1, 
            "2017-12-03", 
            """{"1":[{"john":[12443,12441],"james":[14380,14379,13463],"mike":[15284,15280]}],"2":[{"brian":[15284,15280],"julio":[15284],"org":[]}]}"""
        ),
        (
            2, 
            "2017-12-04", 
            """{"1":[{"john":[12443,12441],"james":[14380,14379,13463],"mike":[15284,15280]}],"2":[{"brian":[15284,15280]}]}"""
        )
    ],
    ("id", "date", "users")
)

并且我有一个加载为json的函数:

@udf("map<string, array<string>>")
def parse(s):
    try:
        return json.loads(s)
    except:
        pass

当我选择顶层时,它看起来不错,但正在为用户删除双引号:

df.select("id", "date", explode(parse("users")).alias("tier_id", "user_list")).show()

+---+----------+-------+--------------------+
| id|      date|tier_id|           user_list|
+---+----------+-------+--------------------+
|  1|2017-12-03|      1|[{john=[12443, 12...|
|  1|2017-12-03|      2|[{julio=[15284], ...|
|  2|2017-12-04|      1|[{john=[12443, 12...|
|  2|2017-12-04|      2|[{brian=[15284, 1...|
+---+----------+-------+--------------------+

当我尝试爆炸用户时,出现以下错误消息:

df.select("id", "date", explode(parse("users")).alias("tier_id", "user_list"))\
    .withColumn("user_list", explode("user_list")).alias("user", "drill").show()

TypeError: alias() takes exactly 2 arguments (3 given)

我认为它不会爆炸user_list,因为所有双引号都被删除了。 我该如何工作?

1 个答案:

答案 0 :(得分:0)

问题是您的udf返回的是字符串数组而不是映射数组。您可以再次使用json库解析该字符串,也可以更改udf以返回正确的类型:

@udf("map<string, array<map<string, array<int>>>>")
def parse(s):
    try:
        return json.loads(s)
    except:
        pass

这将返回预期的类型(字典数组,带有字符串键和整数数组的值)。

df.select("id", "date", explode(parse("users")).alias("tier_id", "user_list"))\
    .withColumn("user_list", explode("user_list"))\
    .select("id", "date", "tier_id", explode("user_list").alias("user", "drill")).show()

# +---+----------+-------+-----+--------------------+
# | id|      date|tier_id| user|               drill|
# +---+----------+-------+-----+--------------------+
# |  1|2017-12-03|      1| john|      [12443, 12441]|
# |  1|2017-12-03|      1| mike|      [15284, 15280]|
# |  1|2017-12-03|      1|james|[14380, 14379, 13...|
# |  1|2017-12-03|      2|julio|             [15284]|
# |  1|2017-12-03|      2|brian|      [15284, 15280]|
# |  1|2017-12-03|      2|  org|                  []|
# |  2|2017-12-04|      1| john|      [12443, 12441]|
# |  2|2017-12-04|      1| mike|      [15284, 15280]|
# |  2|2017-12-04|      1|james|[14380, 14379, 13...|
# |  2|2017-12-04|      2|brian|      [15284, 15280]|
# +---+----------+-------+-----+--------------------+

请注意,您需要爆炸user_list两次,因为它以地图数组而不是地图类型开始。