不久前我问python这个问题,但是现在我需要在PySpark中做同样的事情。
我有一个数据框(df),如下所示:
|cust_id|address |store_id|email |sales_channel|category|
-------------------------------------------------------------------
|1234567|123 Main St|10SjtT |idk@gmail.com|ecom |direct |
|4567345|345 Main St|10SjtT |101@gmail.com|instore |direct |
|1569457|876 Main St|51FstT |404@gmail.com|ecom |direct |
,我想将最后4个字段组合成一个像json这样的元数据字段:
|cust_id|address |metadata |
-------------------------------------------------------------------------------------------------------------------
|1234567|123 Main St|{'store_id':'10SjtT', 'email':'idk@gmail.com','sales_channel':'ecom', 'category':'direct'} |
|4567345|345 Main St|{'store_id':'10SjtT', 'email':'101@gmail.com','sales_channel':'instore', 'category':'direct'}|
|1569457|876 Main St|{'store_id':'51FstT', 'email':'404@gmail.com','sales_channel':'ecom', 'category':'direct'} |
这是我在python中执行此操作的代码:
cols = [
'store_id',
'store_category',
'sales_channel',
'email'
]
df1 = df.copy()
df1['metadata'] = df1[cols].to_dict(orient='records')
df1 = df1.drop(columns=cols)
但是我想将其转换为PySpark代码以使用spark数据框;我不想在Spark中使用熊猫。
答案 0 :(得分:6)
使用 to_json
函数创建json对象!
Example:
from pyspark.sql.functions import *
#sample data
df=spark.createDataFrame([('1234567','123 Main St','10SjtT','idk@gmail.com','ecom','direct')],['cust_id','address','store_id','email','sales_channel','category'])
df.select("cust_id","address",to_json(struct("store_id","category","sales_channel","email")).alias("metadata")).show(10,False)
#result
+-------+-----------+----------------------------------------------------------------------------------------+
|cust_id|address |metadata |
+-------+-----------+----------------------------------------------------------------------------------------+
|1234567|123 Main St|{"store_id":"10SjtT","category":"direct","sales_channel":"ecom","email":"idk@gmail.com"}|
+-------+-----------+----------------------------------------------------------------------------------------+
to_json by passing list of columns:
ll=['store_id','email','sales_channel','category']
df.withColumn("metadata", to_json(struct([x for x in ll]))).drop(*ll).show()
#result
+-------+-----------+----------------------------------------------------------------------------------------+
|cust_id|address |metadata |
+-------+-----------+----------------------------------------------------------------------------------------+
|1234567|123 Main St|{"store_id":"10SjtT","email":"idk@gmail.com","sales_channel":"ecom","category":"direct"}|
+-------+-----------+----------------------------------------------------------------------------------------+
答案 1 :(得分:0)
@Shu 给出了一个很好的答案,这是一个更适合我的用例的变体。我将从 Kafka -> Spark -> Kafka 出发,这个班轮正是我想要的。 struct(*)
将打包数据帧中的所有字段。
# Packup the fields in preparation for sending to Kafka sink
kafka_df = df.selectExpr('cast(id as string) as key', 'to_json(struct(*)) as value')