Question

以下代码从S3加载数据，使用SparkSQL清除和删除重复项，然后使用JDBC将数据保存到Redshift。我也尝试过使用spark-redshift maven依赖并获得相同的结果。我正在使用Spark 2.0。

我无法理解的是，当显示加载到内存中的结果时，总和是预期的数字，但是当Spark保存到Redshift时，它总是更少。不知何故，所有记录都没有保存，我也没有在STL_LOAD_ERRORS中看到任何错误。有人遇到这个或有任何想法，为什么会发生这种情况？

        // Load files that were loaded into firehose on this day
    var s3Files = spark.sqlContext.read.schema(schema).json("s3://" + job.getAWSAccessKey + ":" + job.getAWSSecretKey + "@" + job.getBucketName + "/"+ job.getAWSS3RawFileExpression + "/" + year+ "/" + monthCheck+ "/" + dayCheck + "/*/" ).rdd

    // Apply the schema to the RDD, here we will have duplicates
    val usersDataFrame = spark.createDataFrame(s3Files, schema)

    usersDataFrame.createOrReplaceTempView("results")

    // Clean and use partition by the keys to eliminate duplicates and get latest record
    var results = spark.sql(buildCleaningQuery(job,"results"))
    results.createOrReplaceTempView("filteredResults")

    // This returns the correct result!
    var check = spark.sql("select sum(Reward) from filteredResults where period=1706")
    check.show()

    var path = UUID.randomUUID().toString()

    println("s3://" + job.getAWSAccessKey + ":" + job.getAWSSecretKey + "@" + job.getAWSS3TemporaryDirectory +  "/" + path) 

    val prop = new Properties()

    results.write.jdbc(job.getRedshiftJDBC,"work.\"" + path + "\"",prop)

Answer 1

使用jdbc意味着Spark将尝试执行重复的INSERT INTO语句 - 在Redshift中大量慢。这就是为什么你没有看到stl_load_errors中的条目。

我建议你改用spark-redshift库。它经过了充分的测试，表现更好。 https://github.com/databricks/spark-redshift

示例（显示许多选项）：

my_dataframe.write
   .format("com.databricks.spark.redshift")
   .option("url", "jdbc:redshift://my_cluster.qwertyuiop.eu-west-1.redshift.amazonaws.com:5439/my_database?user=my_user&password=my_password")
   .option("dbtable", "my_table")
   .option("tempdir", "s3://my-bucket")
   .option("diststyle", "KEY")
   .option("distkey", "dist_key")
   .option("sortkeyspec", "COMPOUND SORTKEY(key_1, key_2)")
   .option("extracopyoptions", "TRUNCATECOLUMNS COMPUPDATE OFF STATUPDATE OFF")
   .mode("overwrite") // "append" / "error"
   .save()

Spark不会将所有数据保存为红移

1 个答案: