Question

我们得到了一个带有几个mio的CSV文件。记录。

这些记录中的许多字段都有一个值，这些值引用数据库表上的外键。

在导入过程（到数据库表）中，我们应该通过主表上的ID更改这些值。

我们让它工作了，但是我们认为我们并没有以最佳的方式来完成它，因为它确实很慢，而且消耗了大量的内存。

这是我们第一次使用PySpark。因此，我们欢迎任何建议：）

现在，我们的PySpark脚本连接到所有主表，并加入源数据（s3 csv），然后将结果插入到数据库表中。

一个近似代码：

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

origin = glueContext.create_dynamic_frame.from_catalog(database="glue-test-mysql", table_name="origin_csv")
# get master table data
scenario = glueContext.create_dynamic_frame.from_catalog(database = "glue-test-mysql", table_name = "master_table")

# map fields to output table
origin = ApplyMapping.apply(frame = origin, mappings = [("col0", "string", "Field_3703_aux", "string"), ("col1", "string", "Field_3704_aux", "string"), ("col2", "string", "Field_3705_aux", "string"), ("col3", "string", "Field_3706_aux", "string"), ("col4", "string", "Field_3707_aux", "string"), ("col5", "string", "Field_3708_aux", "string"), ("col6", "string", "Field_3826_aux", "string"), ("col7", "double", "Field_3711", "double"), ("col8", "string", "Field_3712", "string")], transformation_ctx = "origin")

# we keep only our desired fields
scenario = scenario.drop_fields(['Field_3618', 'Field_3620', 'Field_3624', 'Field_3625', 'Field_3626', 'Field_3798', 'Field_3808', 'Field_4397_d_5b27944147d26', 'Field_4398'])

# join master table, and swap Value by ID
origin = Join.apply(origin, scenario, 'Field_3703_aux', 'Field_3619').drop_fields(['Field_3619', 'Field_3703_aux']).rename_field('ID', 'Field_3703')

# insert to our DB Table
datasink4 = glueContext.write_dynamic_frame.from_catalog(frame=origin, name_space='glue-test-mysql', table_name='output_table')

job.commit()

是否有更好的方法来进行这些“连接”？这些主表很小，因此我们可以将它们保存在内存中/缓存中。

将值从主表更改为ID

0 个答案: