我正在对aws s3中存在的数据进行聚合工作,用scala编写,创建一个sbt程序集独立胖jar并使用EMR集群中的spark-submit命令运行我的scala作业。但是火花提交失败了。我在这里失踪了什么。
我的项目结构:
/Users/itru/IdeaProjects/mobilewalla/build.sbt
/Users/itru/IdeaProjects/mobilewalla/src/main/scala/metrics/India.scala
/Users/itru/IdeaProjects/mobilewalla/project/assembly.sbt
我在Intellij中的Scala代码:
package metrics
import org.apache.spark.SparkConf
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.SparkContext
object India {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("IndiaMetrics")
val sc = new SparkContext(sparkConf)
println("Please Provide Input Path")
val input = readLine("prompt> ")
val DataRDD = sc.textFile(input).cache()
val schema = StructType(Array(StructField("device_id", StringType, true)
, StructField("device_id_type", StringType, true)
, StructField("Useragent", StringType, true)
, StructField("ip_address", StringType, true)
, StructField("event_date", StringType, true)
, StructField("Lat", StringType, true)
, StructField("Long", StringType, true)
, StructField("country_cde", StringType, true)))
val rowRDD = DataRDD.map(_.split("\\|\\^")).map(e ⇒ Row(e(0), e(1), e(2), e(3), e(4), e(5), e(6), e(7)))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val mobilewallaRDD = sqlContext.applySchema(rowRDD, schema)
mobilewallaRDD.registerTempTable("mobilewalla")
val DistDeviceIds = sqlContext.sql("select count(DISTINCT device_id) from mobilewalla")
DistDeviceIds.show()
val locEvePerDeviceID = sqlContext.sql("select device_id, count(DISTINCT Lat) as LocationEvents from mobilewalla Group by device_id order by LocationEvents desc")
locEvePerDeviceID.registerTempTable("LocationEventsCount")
val FreqCount = sqlContext.sql("Select LocationEvents,count(LocationEvents) as somanydeviceswithLocationEvents from LocationEventsCount Group by LocationEvents order by somanydeviceswithLocationEvents desc")
println("please provide output path")
val output = readLine("prompt> ")
FreqCount.coalesce(1)
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save(output)
}
}
当我在本地运行时,代码完全正常。
以下是我的Build.sbt文件:
name := "mobilewalla"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies ++= Seq("org.apache.spark" %% "spark-core" % "2.0.0",
"org.apache.spark" %% "spark-sql" % "2.0.0")
assemblyMergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
{
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
}
assemblyJarName in assembly := "mobilewalla_metrics1.jar"
我的assembly.sbt文件:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.2")
addSbtPlugin("io.spray" % "sbt-revolver" % "0.7.2")
到目前为止一切顺利。 sbt compile
,sbt run
和sbt assembly
工作正常,给了我一个肥胖的罐子
我使用scp命令将此jar复制到EMR集群。
当我点火提交工作时,它说:
>spark-submit --class metrics.India --master yarn-client --num-executors 5
--driver-memory 4g --executor-memory 8g --executor-cores 5 --jars mobilewalla_metrics1.jar
Error: Must specify a primary resource (JAR or Python or R file)
我无法弄清问题在哪里。我错过了什么可能是什么问题。