带有AWS Glue的Pyspark将1-N关系连接到JSON数组中

时间:2019-10-17 09:01:31

标签: pyspark aws-glue

不知道如何在AWS Glue上加入1-N关系并导出JSON文件,如:

{"id": 123, "name": "John Doe", "profiles": [ {"id": 1111, "channel": "twitter"}, {"id": 2222, "channel": "twitter"}, {"id": 3333, "channel": "instagram"} ]}
{"id": 345, "name": "Test", "profiles": []}

应该使用其他表来创建配置文件JSON数组。我也想添加频道栏。

我在AWS Glue数据目录上拥有的3个表是:

person_json

{"id": 123,"nanme": "John Doe"}
{"id": 345,"nanme": "Test"}

instagram_json

{"id": 3333, "person_id": 123}
{"id": 3333, "person_id": null}

twitter_json

{"id": 1111, "person_id": 123}
{"id": 2222, "person_id": 123}

这是我到目前为止拥有的脚本:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from pyspark.sql.functions import lit
from awsglue.context import GlueContext
from awsglue.job import Job

glueContext = GlueContext(SparkContext.getOrCreate())

# catalog: database and table names
db_name = "test_database"
tbl_person = "person_json"
tbl_instagram = "instagram_json"
tbl_twitter = "twitter_json"

# Create dynamic frames from the source tables
person = glueContext.create_dynamic_frame.from_catalog(database=db_name, table_name=tbl_person)
instagram = glueContext.create_dynamic_frame.from_catalog(database=db_name, table_name=tbl_instagram)
twitter = glueContext.create_dynamic_frame.from_catalog(database=db_name, table_name=tbl_twitter)

# Join the frames
joined_instagram = Join.apply(person, instagram, 'id', 'person_id').drop_fields(['person_id'])
joined_all = Join.apply(joined_instagram, twitter, 'id', 'person_id').drop_fields(['person_id'])

# Writing output to S3
output_s3_path = "s3://xxx/xxx/person.json"
output = joined_all.toDF().repartition(1)
output.write.mode("overwrite").json(output_s3_path)

应该如何更改脚本才能获得所需的输出?

谢谢

1 个答案:

答案 0 :(得分:0)

from pyspark.sql.functions import collect_set, lit, struct
...
instagram = instagram.toDF().withColumn( 'channel', lit('instagram') )
instagram = instagram.withColumn( 'profile', struct('id', 'channel') )
twitter = twitter.toDF().withColumn( 'channel', lit('twitter') )
twitter = twitter.withColumn( 'profile', struct('id', 'channel') )

profiles = instagram.union(twitter)
profiles = profiles.groupBy('person_id').agg( collect_set('profile').alias('profiles') )

joined_all = person.join(profiles, person.id == profiles.person_id, 'left_outer').drop('channel', 'person_id')
joined_all.show(n=2, truncate=False)

+---+--------+-----------------------------------------------------+
|id |name    |profiles                                             |
+---+--------+-----------------------------------------------------+
|123|John Doe|[[1111, twitter], [2222, twitter], [3333, instagram]]|
|345|Test    |null                                                 |
+---+--------+-----------------------------------------------------+

.show()在个人资料字段中未显示结构的完整结构。

print(joined_all.collect())
[Row(id=123, name='John Doe', profiles=[Row(id=1111, channel='twitter'), Row(id=2222, channel='twitter'), Row(id=3333, channel='instagram')]), Row(id=345, name='Test', profiles=None)]