我有一个类似于以下的rdd:
rdd1 = sc.parallelize([('C3', ['P8', 'P3', 'P2']), ('C1', ['P1', 'P5', 'P5', 'P2']), ('C4', ['P3', 'P4']), ('C2', ['P3']), ('C5', ['P3', 'P9'])])
我有一个类似于以下的数据框:
new_df = spark.createDataFrame([
("P1", "Shirt", "Green", 25, 2000),
("P2", "Jeans", "yello", 30, 1500),
("P3", "Sweater", "Red", 35, 1000),
("P4", "Kurta", "Black", 28, 950),
("P5", "Saree", "Green", 25, 1500),
("P8", "Shirt", "Black", 32, 2500),
("P9", "Sweater", "Red", 30, 1000)
], ["Product", "Item", "Color", "Size", "Price"])
我需要从rdd1创建一个rdd,其中值列表应该替换为dataframe中的详细信息,例如,P8信息应该从new_df数据帧中替换。我期待输出rdd类似于以下内容:
[('C3', [{'Price': '2500', 'Color ': 'Black', 'Size': '32', 'Item': 'Shirt'}, {'Price': '1000', 'Color ': 'Red', 'Size': '35', 'Item': 'Sweater'}, {'Price': '1500', 'Color ': 'Yellow', 'Size': '30', 'Item': 'Jeans'}]), ('C1', [{'Price': '2000', 'Color ': 'Green', 'Size': '25', 'Item': 'Shirt'}, {'Price': '1500', 'Color ': 'Green', 'Size': '25', 'Item': 'Saree'}, {'Price': '1500', 'Color ': 'Green', 'Size': '25', 'Item': 'Saree'}, {'Price': '1500', 'Color ': 'Yellow', 'Size': '30', 'Item': 'Jeans'}]), ('C4', [{'Price': '1000', 'Color ': 'Red', 'Size': '35', 'Item': 'Sweater'}, {'Price': '950', 'Color ': 'Black', 'Size': '28', 'Item': 'Kurta'}]), ('C2', [{'Price': '1000', 'Color ': 'Red', 'Size': '35', 'Item': 'Sweater'}]), ('C5', [{'Price': '1000', 'Color ': 'Red', 'Size': '35', 'Item': 'Sweater'}, {'Price': '1000', 'Color ': 'Red', 'Size': '30', 'Item': 'Sweater'}])]
答案 0 :(得分:1)
您也应该将rdd1
转换为数据框。然后,您需要在创建的数据框中explode
产品数组,以便join
两个数据框使用常见的Product
列。然后,您可以将new_df
的已加入列转换为 json ,仅选择所需的列。最后一步是 group ,如原始rdd1
和收集 json字符串。
from pyspark.sql import functions as F
dataframe = sqlContext.createDataFrame(rdd1, ['id', 'Product'])\
.withColumn('Product', F.explode(F.col('Product')))\
.join(new_df, ['Product'], 'left')\
.select('id', F.to_json(F.struct(F.col('Price'), F.col('Color'), F.col('Size'), F.col('Item'))).alias('json'))\
.groupBy('id')\
.agg(F.collect_list('json'))
应该为您输出dataframe
+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |collect_list(json) |
+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|C3 |[{"Price":1500,"Color":"yello","Size":30,"Item":"Jeans"}, {"Price":2500,"Color":"Black","Size":32,"Item":"Shirt"}, {"Price":1000,"Color":"Red","Size":35,"Item":"Sweater"}] |
|C4 |[{"Price":1000,"Color":"Red","Size":35,"Item":"Sweater"}, {"Price":950,"Color":"Black","Size":28,"Item":"Kurta"}] |
|C5 |[{"Price":1000,"Color":"Red","Size":35,"Item":"Sweater"}, {"Price":1000,"Color":"Red","Size":30,"Item":"Sweater"}] |
|C1 |[{"Price":1500,"Color":"yello","Size":30,"Item":"Jeans"}, {"Price":2000,"Color":"Green","Size":25,"Item":"Shirt"}, {"Price":1500,"Color":"Green","Size":25,"Item":"Saree"}, {"Price":1500,"Color":"Green","Size":25,"Item":"Saree"}]|
|C2 |[{"Price":1000,"Color":"Red","Size":35,"Item":"Sweater"}] |
+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
将上述dataframe
更改为rdd
只是致电.rdd
api
<强>更新强>
来自下面的评论
预期的数据框应如下所示:| C3 | [地图(项目 - &gt;衬衫,价格 - > 2500,尺寸 - > 32,颜色 - &gt;黑色),地图(项目 - &gt;毛衣,价格) - &gt; 1000,尺寸 - &gt; 35,颜色 - &gt;红色),地图(商品 - &gt;牛仔裤,价格 - &gt; 1500,尺寸 - > 30,颜色 - &gt;黄色)]然后只有我可以转换它适当地rdd
您似乎在收集列表中寻找MapType
而不是StringType
。为此,您必须编写udf
函数
from pyspark.sql import functions as F
from pyspark.sql import types as T
def mapFunction(y):
print y
newMap = {}
for key, value in zip(columns, y):
newMap.update({key: value})
return newMap
udfFunction = F.udf(mapFunction, T.MapType(T.StringType(), T.StringType()))
并在代码中调用它,而不是to_json
和struct
函数
dataframe = sqlContext.createDataFrame(rdd1, ['id', 'Product']) \
.withColumn('Product', F.explode(F.col('Product'))) \
.join(new_df, ['Product'], 'left') \
.select('id', udfFunction(F.array([F.col(x) for x in columns])).alias('json')) \
.groupBy('id') \
.agg(F.collect_list('json'))
dataframe.show(truncate=False)
您应该输出
+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |collect_list(json) |
+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|C3 |[Map(Item -> Jeans, Price -> 1500, Size -> 30, Color -> yello), Map(Item -> Shirt, Price -> 2500, Size -> 32, Color -> Black), Map(Item -> Sweater, Price -> 1000, Size -> 35, Color -> Red)] |
|C4 |[Map(Item -> Sweater, Price -> 1000, Size -> 35, Color -> Red), Map(Item -> Kurta, Price -> 950, Size -> 28, Color -> Black)] |
|C5 |[Map(Item -> Sweater, Price -> 1000, Size -> 35, Color -> Red), Map(Item -> Sweater, Price -> 1000, Size -> 30, Color -> Red)] |
|C1 |[Map(Item -> Jeans, Price -> 1500, Size -> 30, Color -> yello), Map(Item -> Shirt, Price -> 2000, Size -> 25, Color -> Green), Map(Item -> Saree, Price -> 1500, Size -> 25, Color -> Green), Map(Item -> Saree, Price -> 1500, Size -> 25, Color -> Green)]|
|C2 |[Map(Item -> Sweater, Price -> 1000, Size -> 35, Color -> Red)] |
+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+