Question

我已经从https://github.com/apache/spark.git设置了Spark核心项目。我已经调用了其中一个测试类：CacheManagerSuite并且它通过了。

如何在源上运行一些Spark转换/操作？我需要在Spark项目源中调用哪个类/对象才能在下面运行：？

scala> val x = sc.parallelize(List(List("a"), List("b"), List("c", "d")))
x: org.apache.spark.rdd.RDD[List[String]] = ParallelCollectionRDD[1] at parallelize at <console>:12

scala> x.collect()
res0: Array[List[String]] = Array(List(a), List(b), List(c, d))

scala> x.flatMap(y => y)
res3: org.apache.spark.rdd.RDD[String] = FlatMappedRDD[3] at flatMap at <console>:15

Answer 1

Spark核心项目包含单元测试，可以更清晰地进行并行化和排序。减少方法称为＆amp;实现。

在org.apache.spark.util.ClosureCleanerSuite中，可以拨打TestClassWithoutDefaultConstructor

org.apache.spark.util.TestClassWithoutDefaultConstructor调用并行化＆amp;减少Spark的方法：

class TestClassWithoutDefaultConstructor(x: Int) extends Serializable {
  def getX = x

  def run(): Int = {
    var nonSer = new NonSerializable
    withSpark(new SparkContext("local", "test")) { sc =>
      val nums = sc.parallelize(Array(1, 2, 3, 4))
      nums.map(_ + getX).reduce(_ + _)
    }
  }
}

相似性org.apache.spark.rdd.PairRDDFunctionsSuite包含对groupByKey

的方法调用

以上测试在本地计算机上编译和运行

Answer 2

要试用Spark，就像在引用的示例中一样，启动bin/spark-shell。

为了运行从源构建的Scala Spark作业，要调用哪个类/对象？

2 个答案: