查看this示例。它从s3目录中读取数据,然后写回s3文件夹。但是如果我添加数据并重新运行这项工作呢?我是对的,再次粘贴读取和写入所有数据?或者它只检测(如何?)新数据并只写它?
顺便说一下,如果我从分区数据中读取,我必须自己指定“新到达”分区吗?
答案 0 :(得分:1)
从我在该示例中看到的内容,他们正在从S3中的已爬网位置进行读取,然后每次都替换一个文件,完全重新加载所有数据。
要仅处理新文件,您需要为您的工作启用Bookmarks,并确保通过执行以下操作来提交作业:
args = getResolvedOptions(sys.argv, [‘TempDir’,’JOB_NAME’])
glue_context = GlueContext(SparkContext.getOrCreate()
# Instantiate your job object to later commit
job = Job(glue_context)
job.init(args[‘JOB_NAME’], args)
# Read file, if you enable Bookmark and commit at the end, this will only
# give you new files
dynamic_frame = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = tbl_name)
result_dynamic_frame = # do some operations
# Append operation to create new parquet files from new data
result_dynamic_frame.toDF().write
.mode("append")
.parquet("s3://bucket/prefix/permit-inspections.parquet")
# Commit my job so next time we read, only new files will come in
job.commit()
希望这有帮助