Question

I have a data frame that looks like so:

>>> l = [('a', 'foo', 1), ('b', 'bar', 1), ('a', 'biz', 6), ('c', 'bar', 3), ('c', 'biz', 2)]
>>> df = spark.createDataFrame(l, ('uid', 'code', 'level')) 
>>> df.show()
+---+----+-----+
|uid|code|level|
+---+----+-----+
|  a| foo|    1|
|  b| bar|    1|
|  a| biz|    6|
|  c| bar|    3|
|  c| biz|    2|
+---+----+-----+

What I'm trying to do is group the code and level values into a list of dict and dump that list as a JSON string so that I can save the data frame to disk. The result would look like:

>>> df.show()
+---+--------------------------+
|uid|           json           |
+---+--------------------------+
|  a| '[{"foo":1}, {"biz":6}]' |
|  b| '[{"bar":1}]'            |
|  c| '[{"bar":3}, {"biz":2}]' |
+---+--------------------------+

I'm still pretty new to use PySpark and I'm having a lot of trouble figuring out how to get this result. I almost surely need a groupBy and I've tried implementing this by creating a new StringType column called "json" and then using the pandas_udf decorator but I'm getting errors about unhasable types, because, as I've found out, the way I'm accessing the data is accessing the whole column, not just the row.

>>> df = df.withColumn('json', F.list(''))
>>> schema = df.schema
>>> @pandas_udf(schema, F.PandasUDFType.GROUPED_MAP)
..: def to_json(pdf):
..:     return pdf.assign(serial=json.dumps({pdf.code:pdf.level}))

I've considered using string concatenation between the two columns and using collect_set but that feels wrong as well since it has the potential to write to disk that which can't be JSON loaded just because it has a string representation. Any help is appreciated.

Answer 1

在这种情况下，不需要pandas_udf。 to_json，collect_list和create_map应该是您所需要的：

import pyspark.sql.functions as f

df.groupby('uid').agg(
  f.to_json(
    f.collect_list(
      f.create_map('code', 'level')
    )
  ).alias('json')
).show(3, False)
+---+---------------------+
|uid|json                 |
+---+---------------------+
|c  |[{"bar":3},{"biz":2}]|
|b  |[{"bar":1}]          |
|a  |[{"foo":1},{"biz":6}]|
+---+---------------------+

Creating JSON String from Two Columns in PySpark GroupBy

1 个答案: