I have a data frame that looks like so:
>>> l = [('a', 'foo', 1), ('b', 'bar', 1), ('a', 'biz', 6), ('c', 'bar', 3), ('c', 'biz', 2)]
>>> df = spark.createDataFrame(l, ('uid', 'code', 'level'))
>>> df.show()
+---+----+-----+
|uid|code|level|
+---+----+-----+
| a| foo| 1|
| b| bar| 1|
| a| biz| 6|
| c| bar| 3|
| c| biz| 2|
+---+----+-----+
What I'm trying to do is group the code
and level
values into a list
of dict
and dump that list as a JSON string so that I can save the data frame to disk. The result would look like:
>>> df.show()
+---+--------------------------+
|uid| json |
+---+--------------------------+
| a| '[{"foo":1}, {"biz":6}]' |
| b| '[{"bar":1}]' |
| c| '[{"bar":3}, {"biz":2}]' |
+---+--------------------------+
I'm still pretty new to use PySpark and I'm having a lot of trouble figuring out how to get this result. I almost surely need a groupBy
and I've tried implementing this by creating a new StringType
column called "json" and then using the pandas_udf
decorator but I'm getting errors about unhasable types, because, as I've found out, the way I'm accessing the data is accessing the whole column, not just the row.
>>> df = df.withColumn('json', F.list(''))
>>> schema = df.schema
>>> @pandas_udf(schema, F.PandasUDFType.GROUPED_MAP)
..: def to_json(pdf):
..: return pdf.assign(serial=json.dumps({pdf.code:pdf.level}))
I've considered using string concatenation between the two columns and using collect_set
but that feels wrong as well since it has the potential to write to disk that which can't be JSON loaded just because it has a string representation. Any help is appreciated.
答案 0 :(得分:2)
在这种情况下,不需要pandas_udf
。 to_json
,collect_list
和create_map
应该是您所需要的:
import pyspark.sql.functions as f
df.groupby('uid').agg(
f.to_json(
f.collect_list(
f.create_map('code', 'level')
)
).alias('json')
).show(3, False)
+---+---------------------+
|uid|json |
+---+---------------------+
|c |[{"bar":3},{"biz":2}]|
|b |[{"bar":1}] |
|a |[{"foo":1},{"biz":6}]|
+---+---------------------+