使用Gradle从Intellij运行spark-redshift

时间:2015-11-04 17:02:24

标签: apache-spark apache-spark-sql

我正在尝试使用spark-redshift库,无法对sqlContext.read()命令创建的数据帧进行操作(从redshift读取)。

这是我的代码:

Class.forName("com.amazon.redshift.jdbc41.Driver")

val conf = new SparkConf().setAppName("Spark Application").setMaster("local[2]")
val sc = new SparkContext(conf)

import org.apache.spark.sql._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "****")

sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "****") 

val df: DataFrame = sqlContext.read
  .format("com.databricks.spark.redshift")
  .option("url", "jdbc:redshift://URL")
  .option("dbtable", "table")
  .option("tempdir", "s3n://bucket/folder")
  .load()

df.registerTempTable("table")
val data = sqlContext.sql("SELECT * FROM table")

data.show()    

这是我在scala对象的main方法中运行上面的代码时收到的错误:

Exception in thread "main" java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$
at org.apache.spark.SparkContext.withScope(SparkContext.scala:709)
at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:1096)
at com.databricks.spark.redshift.RedshiftRelation.buildScan(RedshiftRelation.scala:116)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$3.apply(DataSourceStrategy.scala:53)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$3.apply(DataSourceStrategy.scala:53)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:279)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:278)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProjectRaw(DataSourceStrategy.scala:310)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProject(DataSourceStrategy.scala:274)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:49)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:374)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:926)
at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:924)
at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:930)
at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:930)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1314)
at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1377)
at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:178)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:401)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:362)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:370)
at com.triplelift.spark.Main$.main(Main.scala:37)
at com.triplelift.spark.Main.main(Main.scala)

如果这有帮助,我也有我的gradle依赖项:

dependencies {
    compile (
        'com.amazonaws:aws-java-sdk:1.10.31',
        'com.amazonaws:aws-java-sdk-redshift:1.10.31',
        'org.apache.spark:spark-core_2.10:1.5.1',
        'org.apache.spark:spark-streaming_2.10:1.5.1',
        'org.apache.spark:spark-mllib_2.10:1.5.1',
        'org.apache.spark:spark-sql_2.10:1.5.1',
        'com.databricks:spark-redshift_2.10:0.5.2',
        'com.fasterxml.jackson.core:jackson-databind:2.6.3'
    )

    testCompile group: 'junit', name: 'junit', version: '4.11'
}

毋庸置疑,在评估data.show()时会发生错误。

在一个不相关的说明中......任何使用Intellij 14的人都知道如何永久地将Redshift驱动程序添加到模块中?每次进行gradle刷新时,它都会从项目结构中的依赖项中删除。怪异。

1 个答案:

答案 0 :(得分:2)

最初的问题是收到此错误:

com.fasterxml.jackson.databind.JsonMappingException: 
Could not find creator property with name 'id' (in class org.apache.spark.rdd.RDDOperationScope)

所以我在这里听到了这个答案:

Spark Parallelize? (Could not find creator property with name 'id')

所以我添加了这行' com.fasterxml.jackson.core:jackson-databind:2.6.3'并在不同版本之间切换(即2.4.4),然后在项目视图中开始查看我的外部库...所以我删除了新的jackson-databind依赖项,并希望看到所有引发负载的jackson库... #39;当我注意到jackson库都是2.5.1时,除了jackson-module-scala_2.10,它是在2.4.4上 - 所以我没有用jackson-databind依赖,而是添加了这个: / p>

compile 'com.fasterxml.jackson.module:jackson-module-scala_2.10:2.6.3'

现在我的代码工作了。好像火花核心1.51在放入maven之前没有正确构建?不确定。

注意:请务必检查您的传递依赖关系及其版本......