AWS胶水限制输入大小

时间:2018-03-06 14:19:27

标签: python apache-spark dataframe aws-glue

我想用一小部分可用数据来测试我的AWS Glue PySpark作业。如何实现这一目标?

我的第一次尝试是将Glue动态帧转换为spark数据帧,并使用take(n)方法来限制要处理的行数,如下所示:

datasource0 = glueContext.create_dynamic_frame.from_catalog(
    database = "my_db",
    table_name = "my_table",
    transformation_ctx = "ds0")

applymapping1 = ApplyMapping.apply(
    frame = datasource0, 
    mappings = [("foo", "string", "bar", "string")],
    transformation_ctx = "am1")

truncated_df = applymapping1.toDF().take(1000)

datasink2 = glueContext.write_dynamic_frame.from_options(
    frame = DynamicFrame.fromDF(truncated_df, glueContext, "tdf"),
    connection_type = "s3", 
    ... )

job.commit()

此操作失败,并显示以下错误消息:

AttributeError: 'list' object has no attribute '_jdf'

有什么想法吗?

1 个答案:

答案 0 :(得分:1)

尝试分别转换数据,然后在datasink中使用动态帧名称