AWS Glue:在转换为镶木地板时使用ResolveChoice投影到时间戳下拉字段

时间:2018-05-22 22:21:38

标签: apache-spark parquet aws-glue

尝试将一系列压缩gz转换为镶木地板格式。

在尝试进行一些转换的过程中。 (减少字段数,强制转换等)在进行一些调试后,似乎当我尝试将字段投影到时间戳时,生成的镶木地板文件缺少该字段。

相关的python片段:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_options(connection_type = "s3", connection_options = {"paths": ["<s3 path>"]}, format = "json", transformation_ctx = "read")

datasource0 = ApplyMapping.apply(frame = datasource0, mappings = [("timestamp", "string", "timestamp", "long"), ("name", "string", "name", "string"), ("value", "string", "value", "string"), ("type", "string", "type", "string")])
datasource0 = SelectFields.apply(frame = datasource0, paths = ["timestamp", "name", "value", "type"])

# here is where the parquet schema changes 
# the timestamp column is no longer there in parquet tools

datasource0 = ResolveChoice.apply(frame = datasource0, specs = [('timestamp','project:timestamp'), ('name','cast:string'), ('type','cast:string'), ('value','cast:string')])


glueContext.write_dynamic_frame.from_options(datasource0, connection_type = "s3", connection_options = {"path": "<another s3 path>"}, format = "parquet", format_options = {'compression': 'gzip'}, transformation_ctx = "write")

job.commit()

直到ResolveChoice选择,如果我在得到的镶木地板文件上做镶木地板工具,我会看到四个字段。

然而,在使用该行后,我得到了这个:

message spark_schema {
  optional binary name (UTF8);
  optional binary value (UTF8);
  optional binary type (UTF8);
}

缺少时间戳字段。

项目:通过将所有数据投影到一种可能的数据类型来解决潜在的歧义。例如,如果列中的数据可以是int或字符串,则使用project:string action会在生成的DynamicFrame中生成一列,其中所有int值都将转换为字符串。

所以我想知道没有办法投射到时间戳类型吗?

0 个答案:

没有答案