scala.Function0在Scala中运行spark简单的WordCount

时间:2017-10-26 17:36:52

标签: scala apache-spark sbt bigdata data-science

我正在尝试运行一个简单的程序,用火花计算scala中的单词。我自己在linux中完成了所有的安装,我无法执行它,因为我遇到了这个错误:

java.lang.ClassNotFoundException: scala.Function0
at sbt.internal.inc.classpath.ClasspathFilter.loadClass(ClassLoaders.scala:74)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at com.twitter.chill.KryoBase$$anonfun$1.apply(KryoBase.scala:41)
at com.twitter.chill.KryoBase$$anonfun$1.apply(KryoBase.scala:41)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at com.twitter.chill.KryoBase.<init>(KryoBase.scala:41)
at com.twitter.chill.EmptyScalaKryoInstantiator.newKryo(ScalaKryoInstantiator.scala:57)
at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:96)
at org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:292)
at org.apache.spark.serializer.KryoSerializerInstance.<init>(KryoSerializer.scala:277)
at org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:186)
at org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects$lzycompute(KryoSerializer.scala:193)
at org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects(KryoSerializer.scala:189)
at org.apache.spark.shuffle.sort.SortShuffleManager$.canUseSerializedShuffle(SortShuffleManager.scala:187)
at org.apache.spark.shuffle.sort.SortShuffleManager.registerShuffle(SortShuffleManager.scala:99)
at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:90)
at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:87)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.dependencies(RDD.scala:237)
at org.apache.spark.scheduler.DAGScheduler.getShuffleDependencies(DAGScheduler.scala:431)
at org.apache.spark.scheduler.DAGScheduler.getOrCreateParentStages(DAGScheduler.scala:380)
at org.apache.spark.scheduler.DAGScheduler.createResultStage(DAGScheduler.scala:367)
at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:850)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1677)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

基本上我尝试执行的代码是:

import org.apache.spark.sql.SparkSession

object SparkWordCount extends App {

  val spark = SparkSession.builder
    .master("local[*]")
    .appName("Spark Word Count")
    .getOrCreate()

  val lines = spark.sparkContext.parallelize(
    Seq("Spark Intellij Idea Scala test one",
      "Spark Intellij Idea Scala test two",
      "Spark Intellij Idea Scala test three"))

  val counts = lines.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

  counts.foreach(println)
}

我认为它应该是由于spark和scala版本的东西,但我找不到好的解决方案。

我的build.sbt看起来像这样:

name := "com.example.test"
version := "0.1"
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "2.2.0",
  "org.apache.spark" %% "spark-sql" % "2.2.0"
)

如果我在spark-shell中执行相同的代码,我就安装了以下版本:

  • scala:2.11.8
  • spark:2.2.0
  • sbt:1.0.2

我也试过不同的scala版本,并且火花,一切,但不起作用,任何人都可以帮助我吗?

提前谢谢你, 哈维。

  • 当我在项目中执行命令sbt时,显示:

    获取Scala 2.12.3(对于sbt)...

我无法理解,因为我在build.sbt中指定了scala版本(scalaVersion:=&#34; 2.11.8&#34;)

1 个答案:

答案 0 :(得分:1)

我会假设你试图从Intellij Idea运行这段代码。如果这是真正的主要问题在于Idea配置。

转到文件&gt; 项目结构&gt; 全局库并检查您是否安装了scala-library。 Idea使用自己的库和SDK来源,这就是为什么一切都在shell中运行良好的原因。

我在纯sbt布局中尝试了您的代码,它使用完全相同的build.sbt和源文件。

一件小事。也许Idea会在运行结束时自动停止所有正在运行的火花会话,但我相信您必须明确停止活动会话。只需将spark.stop()放在可执行函数或对象的末尾。