在Apache Spark中的多个节点上运行代码时的注意事项

时间:2014-12-03 14:44:01

标签: scala apache-spark

下面的代码读取一个文件(example.txt)并计算每个点之间的eucleudian距离。 example.txt(下面引用)的内容是: “ 一个,1 B,1 C,2 “ 此代码按预期工作,但对于大型数据集,它非常慢。 除了过滤冗余的比较,例如(a,b)& (b,a) - > (b,a)比较是重复的)

我应该注意什么?目前我只是在单个节点上运行此代码。但是要在多个节点上运行它 有什么考虑因素我会考虑到吗?

import org.apache.spark.SparkContext;

object first {
  println("Welcome to the Scala worksheet")

  val conf = new org.apache.spark.SparkConf()
    .setMaster("local")
    .setAppName("distances")
    .setSparkHome("C:\\spark-1.1.0-bin-hadoop2.4\\spark-1.1.0-bin-hadoop2.4")
    .set("spark.executor.memory", "2g")
  val sc = new SparkContext(conf)

  def euclDistance(userA: User, userB: User) = {

    val subElements = (userA.features zip userB.features) map {
      m => (m._1 - m._2) * (m._1 - m._2)
    }
    val summed = subElements.sum
    val sqRoot = Math.sqrt(summed)

    println("value is" + sqRoot)
    ((userA.name, userB.name), sqRoot)
  }

  case class User(name: String, features: Vector[Double])

  def createUser(data: String) = {

    val id = data.split(",")(0)
    val splitLine = data.split(",")

    val distanceVector = (splitLine.toList match {
      case h :: t => t
    }).map(m => m.toDouble).toVector

    User(id, distanceVector)

  }

  val dataFile = sc.textFile("c:\\data\\example.txt")
  val users = dataFile.map(m => createUser(m))
  val cart = users.cartesian(users) //
  val distances = cart.map(m => euclDistance(m._1, m._2))
  //> distances  : org.apache.spark.rdd.RDD[((String, String), Double)] = MappedR
  //| DD[4] at map at first.scala:46
  val d = distances.collect //

  d.foreach(println) //> ((a,a),0.0)
  //| ((a,b),0.0)
  //| ((a,c),1.0)
  //| ((a,),0.0)
  //| ((b,a),0.0)
  //| ((b,b),0.0)
  //| ((b,c),1.0)
  //| ((b,),0.0)
  //| ((c,a),1.0)
  //| ((c,b),1.0)
  //| ((c,c),0.0)
  //| ((c,),0.0)
  //| ((,a),0.0)
  //| ((,b),0.0)
  //| ((,c),0.0)
  //| ((,),0.0)

}

1 个答案:

答案 0 :(得分:0)

多个节点上的Spark应该运行得更快,而不需要任何代码更改。您可以调整它以像任何其他软件系统一样运行得更快。

现在,如果您只为其提供更多内核,则可以更快地运行本地代码。

将以下内容更改为

.setMaster("local")

 .setMaster("local[4]") //4 or 8 or 16 depending on how many cores you have on your local machine.