此平面json嵌套在pyspark中。
{
'event_type': 'click',
'id': '223',
'person_id': 201031940,
'category': 'Chronicles',
'approved_content': 1
}
到
{
'event_type': 'click',
user: {
'id': '223',
'person_id': 201031940
},
event: {
'category': 'Chronicles',
'approved_content': 1
}
}
答案 0 :(得分:1)
这是您可以做的:
这是完整的代码:
from pyspark.sql.types import (
StringType,
StructField,
StructType,
MapType
)
from pyspark.sql.functions import udf
events_schema = StructType([
StructField('event_type', StringType(), True),
StructField('id', StringType(), True),
StructField('person_id', StringType(), True),
StructField('category', StringType(), True),
StructField('approved_content', StringType(), True),
])
events = [{
'event_type': 'click',
'id': '223',
'person_id': 201031940,
'category': 'Chronicles',
'approved_content': 1
}]
df = spark.createDataFrame(events, schema=events_schema)
build_user_udf = udf(lambda id, person_id: {
'id': id,
'person_id': person_id
}, MapType(StringType(), StringType()))
build_event_udf = udf(lambda category, approved_content: {
'category': category,
'approved_content': approved_content
}, MapType(StringType(), StringType()))
nested_event_df = (
df
.withColumn('user', build_user_udf(df['id'], df['person_id']))
.withColumn('event', build_event_udf(df['category'], df['approved_content']))
.drop('id')
.drop('person_id')
.drop('category')
.drop('approved_content')
)
nested_event_df.toJSON()。first()
'{“ event_type”:“ click”,“ user”:{“ id”:“ 223”,“ person_id”:“ 201031940”},“ event”:{“ approved_content”:“ 1”,“ category “:”编年史“}}'
nested_event_df.take(1)
[Row(event_type ='click',user = {'id':'223','person_id':'201031940'},event = {'approved_content':'1','category':'Chronicles' })]
这是一个非常基本的版本,但是您可以根据需要进行更多优化。
答案 1 :(得分:1)
您也可以不使用udfs来执行此操作,这效率更高,并且在处理大量记录时会大为不同:
import pyspark.sql.fuctions as f
events_schema = StructType([
StructField('event_type', StringType(), True),
StructField('id', StringType(), True),
StructField('person_id', StringType(), True),
StructField('category', StringType(), True),
StructField('approved_content', StringType(), True),
])
events = [{
'event_type': 'click',
'id': '223',
'person_id': 201031940,
'category': 'Chronicles',
'approved_content': 1
}]
df = spark.createDataFrame(events, schema=events_schema)
newDf = (df
.withColumn('user', f.struct(df.id, df.person_id))
.withColumn('event', f.struct(df.category, df.approved_content))
.withColumn('nestedEvent', f.struct(f.col('user'), f.col('event')))
.select('nestedEvent'))