我是新手,并且具有以下格式的数据
类别,子类别,名称
Food,Thai,Restaurant A
Food,Thai,Restaurant B
Food, Chinese, Restaurant C
Lodging, Hotel, Hotel A
我希望数据采用以下格式
{Category : Food , Subcategories : [ {subcategory : Thai , names : [Restaurant A , Restaurant B] }, {subcategory : Chinese , names : [Restaurant C]}]}
{Category : Hotel , Subcategories : [ {subcategory : Lodging , names : [Hotel A] }]}
有人可以帮助我如何使用pyspark RDD解决此问题吗?
谢谢!
答案 0 :(得分:0)
这是有用的解决方案:
创建一个窗口函数来收集名称“类别”和“子类别”
from pyspark.sql import functions as F
from pyspark.sql import Window
groupByCateWind = Window.partitionBy("Category", "Subcategory")
finalDf = df.withColumn("names", F.collect_list("Name").over(groupByCateWind)) \
.withColumn("Subcategories", F.struct("Subcategory", "names")) \
.groupBy("Category").agg(F.collect_set("Subcategories").alias("Subcategories")).toJSON()
通过Window函数收集名称组
使用子类别和名称列创建具有Struct类型的子类别列。
再次按类别分组并收集“子类别”列值。
输出如下:
+---------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"Category":"Food","Subcategories":[{"Subcategory":"Thai","names":["Restaurant A","Restaurant B"]},{"Subcategory":" Chinese","names":[" Restaurant C"]}]}|
|{"Category":"Lodging","Subcategories":[{"Subcategory":" Hotel","names":[" Hotel A"]}]} |
+---------------------------------------------------------------------------------------------------------------------------------------------------------+