发出在AWS Glue中使用空值删除行的问题

时间:2019-02-15 16:56:15

标签: amazon-web-services apache-spark pyspark amazon-redshift aws-glue

当前,AWS Glue Job读取S3集合并将其写入AWS Redshift时遇到问题,其中我们有一列具有null值。

这项工作应该非常简单,并且大多数代码都是由Glue界面自动生成的,但是由于我们在Redshift中没有空列,而在我们的数据集中有时这些列为空,所以我们无法完成该工作。 / p>

下面是代码的精简版,代码在Python中,环境是PySpark。

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "db_1", table_name = "table_1", transformation_ctx = "datasource0")

resolvedDDF = datasource0.resolveChoice(specs = [
  ('price_current','cast:double'),
  ('price_discount','cast:double'),
])

applymapping = ApplyMapping.apply(frame = resolvedDDF, mappings = [
  ("id", "string", "id", "string"), 
  ("status", "string", "status", "string"), 
  ("price_current", "double", "price_current", "double"), 
  ("price_discount", "double", "price_discount", "double"), 
  ("created_at", "string", "created_at", "string"), 
  ("updated_at", "string", "updated_at", "string"), 
], transformation_ctx = "applymapping")

droppedDF = applymapping.toDF().dropna(subset=('created_at', 'price_current'))

newDynamicDF = DynamicFrame.fromDF(droppedDF, glueContext, "newframe")

dropnullfields = DropNullFields.apply(frame = newDynamicDF, transformation_ctx = "dropnullfields")

datasink = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields, catalog_connection = "RedshiftDataStaging", connection_options = {"dbtable": "dbtable_1", "database": "database_1"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink")

我们在Redshift中的price_currentcreated_at表上有一个非空约束,由于我们系统中的某些早期错误,某些记录已到达S3存储桶而没有所需数据。我们只想删除这些行,因为它们只占要处理的全部数据的很小一部分。

尽管使用dropna代码,我们仍然从Redshift收到以下错误。

Error (code 1213) while loading data into Redshift: "Missing data for not-null field"
Table name: "PUBLIC".table_1
Column name: created_at
Column type: timestampt(0)
Raw field value: @NULL@

1 个答案:

答案 0 :(得分:0)

如果您不想删除默认值,则可以传递默认值

  

df = dropnullfields.toDF()

     

df = df.na.fill({'price_current':0.0,'created_at':''})

     

dyf = DynamicFrame.fromDF(df,'glue_context_1')

     

datasink =胶水上下文Transformation_ctx =“ datasink”)

如果要删除,请使用以下代码代替df.na.fill

  

df = df.na.drop(subset = [“ price_current”,“ created_at”])