我的数据框如下:
df = spark.createDataFrame(
[
(
1,
"2017-12-03",
"""{"1":[{"john":[12443,12441],"james":[14380,14379,13463],"mike":[15284,15280]}],"2":[{"brian":[15284,15280],"julio":[15284],"org":[]}]}"""
),
(
2,
"2017-12-04",
"""{"1":[{"john":[12443,12441],"james":[14380,14379,13463],"mike":[15284,15280]}],"2":[{"brian":[15284,15280]}]}"""
)
],
("id", "date", "users")
)
并且我有一个加载为json的函数:
@udf("map<string, array<string>>")
def parse(s):
try:
return json.loads(s)
except:
pass
当我选择顶层时,它看起来不错,但正在为用户删除双引号:
df.select("id", "date", explode(parse("users")).alias("tier_id", "user_list")).show()
+---+----------+-------+--------------------+
| id| date|tier_id| user_list|
+---+----------+-------+--------------------+
| 1|2017-12-03| 1|[{john=[12443, 12...|
| 1|2017-12-03| 2|[{julio=[15284], ...|
| 2|2017-12-04| 1|[{john=[12443, 12...|
| 2|2017-12-04| 2|[{brian=[15284, 1...|
+---+----------+-------+--------------------+
当我尝试爆炸用户时,出现以下错误消息:
df.select("id", "date", explode(parse("users")).alias("tier_id", "user_list"))\
.withColumn("user_list", explode("user_list")).alias("user", "drill").show()
TypeError: alias() takes exactly 2 arguments (3 given)
我认为它不会爆炸user_list,因为所有双引号都被删除了。 我该如何工作?
答案 0 :(得分:0)
问题是您的udf返回的是字符串数组而不是映射数组。您可以再次使用json库解析该字符串,也可以更改udf以返回正确的类型:
@udf("map<string, array<map<string, array<int>>>>")
def parse(s):
try:
return json.loads(s)
except:
pass
这将返回预期的类型(字典数组,带有字符串键和整数数组的值)。
df.select("id", "date", explode(parse("users")).alias("tier_id", "user_list"))\
.withColumn("user_list", explode("user_list"))\
.select("id", "date", "tier_id", explode("user_list").alias("user", "drill")).show()
# +---+----------+-------+-----+--------------------+
# | id| date|tier_id| user| drill|
# +---+----------+-------+-----+--------------------+
# | 1|2017-12-03| 1| john| [12443, 12441]|
# | 1|2017-12-03| 1| mike| [15284, 15280]|
# | 1|2017-12-03| 1|james|[14380, 14379, 13...|
# | 1|2017-12-03| 2|julio| [15284]|
# | 1|2017-12-03| 2|brian| [15284, 15280]|
# | 1|2017-12-03| 2| org| []|
# | 2|2017-12-04| 1| john| [12443, 12441]|
# | 2|2017-12-04| 1| mike| [15284, 15280]|
# | 2|2017-12-04| 1|james|[14380, 14379, 13...|
# | 2|2017-12-04| 2|brian| [15284, 15280]|
# +---+----------+-------+-----+--------------------+
请注意,您需要爆炸user_list
两次,因为它以地图数组而不是地图类型开始。