Spark - 如何在类中使用SparkContext?

时间:2015-07-27 23:08:18

标签: java scala apache-spark

我正在Spark中构建一个应用程序,并希望在我的类中的方法中使用SparkContext和/或SQLContext,主要是从文件或SQL查询中提取/生成数据集。

例如,我想创建一个T2P对象,其中包含收集数据的方法(在这种情况下需要访问SparkContext):

class T2P (mid: Int, sc: SparkContext, sqlContext: SQLContext) extends Serializable {

  def getImps(): DataFrame = {
      val imps = sc.textFile("file.txt").map(line => line.split("\t")).map(d => Data(d(0).toInt, d(1), d(2), d(3))).toDF()
      return imps
   }

  def getX(): DataFrame = {
      val x = sqlContext.sql("SELECT a,b,c FROM table")
      return x
  }
}

//creating the T2P object
class App {
    val conf = new SparkConf().setAppName("T2P App").setMaster("local[2]")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)
    val t2p = new T2P(0, sc, sqlContext);
}

将SparkContext作为参数传递给T2P类并不起作用,因为SparkContext不可序列化(在创建T2P对象时出现task not serializable错误)。在我的类中使用SparkContext / SQLContext的最佳方法是什么?或者这可能是在Spark中设计数据拉取类型过程的错误方法?

更新 从这篇文章的评论中可以看出,SparkContext不是问题所在,但是我在地图中使用了一种方法'函数,导致Spark尝试序列化整个类。这会导致错误,因为SparkContext不可序列化。

def startMetricTo(userData: ((Int, String), List[(Int, String)]), startMetric: String) : T2PUser = {
  //do something 
}

def buildUserRollup() = {
  this.userRollup = this.userSorted.map(line=>startMetricTo(line, this.startMetric))
}

这会导致“不可序列化”的任务。例外。

2 个答案:

答案 0 :(得分:1)

I fixed this problem (with the help of the commenters and other StackOverflow users) by creating a separate MetricCalc object to store my startMetricTo() method. Then I changed the buildUserRollup() method to use this new startMetricTo(). This allows the entire MetricCalc object to be serialized without issue.

//newly created object
object MetricCalc {
    def startMetricTo(userData: ((Int, String), List[(Int, String)]), startMetric: String) : T2PUser = {
    //do something
  }
}

//using function in T2P
def buildUserRollup(startMetric: String) = {
   this.userRollup = this.userSorted.map(line=>MetricCalc.startMetricTo(line, startMetric))
}

答案 1 :(得分:0)

我尝试了几种选择,这最终对我有用..

   object SomeName extends App {

   val conf = new SparkConf()...
   val sc = new SparkContext(conf)

   implicit val sqlC = SQLContext.getOrCreate(sc)
   getDF1(sqlC)

   def getDF1(sqlCo: SQLContext): Unit = {
    val query1 =  SomeQuery here  
    val df1 = sqlCo.read.format("jdbc").options(Map("url" -> dbUrl,"dbtable" -> query1)).load.cache()

     //iterate through df1 and retrieve the 2nd DataFrame based on some values in the Row of the first DataFrame

    df1.foreach(x => {
     getDF2(x.getString(0), x.getDecimal(1).toString, x.getDecimal(3).doubleValue) (sqlCo)
   })     
  }

  def getDF2(a: String, b: String, c: Double)(implicit sqlCont: SQLContext) :  Unit = {
     val query2 = Somequery

     val sqlcc = SQLContext.getOrCreate(sc)
    //val sqlcc = sqlCont //Did not work for me. Also, omitting (implicit sqlCont: SQLContext) altogether did not work
     val df2 = sqlcc.read.format("jdbc").options(Map("url" -> dbURL, "dbtable" -> query2)).load().cache()
       .
       .
       .
     }
  }

注意:在上面的代码中,如果我从getDF2方法签名中省略(隐式sqlCont:SQLContext)参数,它将无法工作。我尝试了将sqlContext从一个方法传递到另一个方法的其他几个选项,它总是给我NullPointerException或Task不可序列化的Excpetion。好的,它最终以这种方式工作,我可以从DataFrame1的一行中检索参数,并在加载DataFrame 2时使用这些值。