Question

我在s3中存储了一些json文件，我需要在它们所在的文件夹中将它们转换为csv格式。

目前，我正在使用胶水将它们映射到雅典娜，但是，正如我所说，现在我需要将它们映射到csv。

是否可以使用胶水作业来做到这一点？

我试图了解粘合作业是否可以爬到我的s3文件夹目录中，并将找到的所有json文件转换为csv（作为新文件）。

如果不可能，是否有任何AWS服务可以帮助我做到这一点？

EDIT1：

这是我要运行的当前代码

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext()
glueContext = GlueContext(sc)

inputGDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": ["s3://agco-sa-dfs-dv/dealer-data"]}, format = "json")
outputGDF = glueContext.write_dynamic_frame.from_options(frame = inputGDF, connection_type = "s3", connection_options = {"path": "s3://agco-sa-dfs-dv/dealer-data"}, format = "csv")

该作业运行没有错误，但是s3文件夹上似乎没有任何反应。我假设代码将从/ dealer-data获取json文件并将其转换为与csv相同的文件夹。我可能是错的。

EDIT2：

好吧，我几乎可以按照需要的方式工作。

问题是，创建动态框架仅适用于带有文件的文件夹，不适用于带有文件的子文件夹的文件夹。

import sys
import logging
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

sc = SparkContext()
glueContext = GlueContext(sc)

inputGDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": ["s3://agco-sa-dfs-dv/dealer-data/installations/3555/2019/2"]}, format = "json")

outputGDF = glueContext.write_dynamic_frame.from_options(frame = inputGDF, connection_type = "s3", connection_options = {"path": "s3://agco-sa-dfs-dv/dealer-data/installations/3555/2019/2/bla.csv"}, format = "csv")

以上方法有效，但仅适用于该目录（../2）有没有办法读取给定文件夹和子文件夹的所有文件？

Answer 1

您应将S3 connection的recurse选项设置为True：

inputGDF = glueContext.create_dynamic_frame_from_options(
    connection_type = "s3", 
    connection_options = {
        "paths": ["s3://agco-sa-dfs-dv/dealer-data/installations/3555/2019/2"],
        "recurse" : True
    }, 
    format = "json
)

如何使用胶水将s3中存储的json文件转换为csv？

1 个答案: