我正在Spark中构建一个应用程序,并希望在我的类中的方法中使用SparkContext和/或SQLContext,主要是从文件或SQL查询中提取/生成数据集。
例如,我想创建一个T2P对象,其中包含收集数据的方法(在这种情况下需要访问SparkContext):
class T2P (mid: Int, sc: SparkContext, sqlContext: SQLContext) extends Serializable {
def getImps(): DataFrame = {
val imps = sc.textFile("file.txt").map(line => line.split("\t")).map(d => Data(d(0).toInt, d(1), d(2), d(3))).toDF()
return imps
}
def getX(): DataFrame = {
val x = sqlContext.sql("SELECT a,b,c FROM table")
return x
}
}
//creating the T2P object
class App {
val conf = new SparkConf().setAppName("T2P App").setMaster("local[2]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val t2p = new T2P(0, sc, sqlContext);
}
将SparkContext作为参数传递给T2P类并不起作用,因为SparkContext不可序列化(在创建T2P对象时出现task not serializable
错误)。在我的类中使用SparkContext / SQLContext的最佳方法是什么?或者这可能是在Spark中设计数据拉取类型过程的错误方法?
更新 从这篇文章的评论中可以看出,SparkContext不是问题所在,但是我在地图中使用了一种方法'函数,导致Spark尝试序列化整个类。这会导致错误,因为SparkContext不可序列化。
def startMetricTo(userData: ((Int, String), List[(Int, String)]), startMetric: String) : T2PUser = {
//do something
}
def buildUserRollup() = {
this.userRollup = this.userSorted.map(line=>startMetricTo(line, this.startMetric))
}
这会导致“不可序列化”的任务。例外。
答案 0 :(得分:1)
I fixed this problem (with the help of the commenters and other StackOverflow users) by creating a separate MetricCalc
object to store my startMetricTo() method. Then I changed the buildUserRollup() method to use this new startMetricTo(). This allows the entire MetricCalc
object to be serialized without issue.
//newly created object
object MetricCalc {
def startMetricTo(userData: ((Int, String), List[(Int, String)]), startMetric: String) : T2PUser = {
//do something
}
}
//using function in T2P
def buildUserRollup(startMetric: String) = {
this.userRollup = this.userSorted.map(line=>MetricCalc.startMetricTo(line, startMetric))
}
答案 1 :(得分:0)
我尝试了几种选择,这最终对我有用..
object SomeName extends App {
val conf = new SparkConf()...
val sc = new SparkContext(conf)
implicit val sqlC = SQLContext.getOrCreate(sc)
getDF1(sqlC)
def getDF1(sqlCo: SQLContext): Unit = {
val query1 = SomeQuery here
val df1 = sqlCo.read.format("jdbc").options(Map("url" -> dbUrl,"dbtable" -> query1)).load.cache()
//iterate through df1 and retrieve the 2nd DataFrame based on some values in the Row of the first DataFrame
df1.foreach(x => {
getDF2(x.getString(0), x.getDecimal(1).toString, x.getDecimal(3).doubleValue) (sqlCo)
})
}
def getDF2(a: String, b: String, c: Double)(implicit sqlCont: SQLContext) : Unit = {
val query2 = Somequery
val sqlcc = SQLContext.getOrCreate(sc)
//val sqlcc = sqlCont //Did not work for me. Also, omitting (implicit sqlCont: SQLContext) altogether did not work
val df2 = sqlcc.read.format("jdbc").options(Map("url" -> dbURL, "dbtable" -> query2)).load().cache()
.
.
.
}
}
注意:在上面的代码中,如果我从getDF2方法签名中省略(隐式sqlCont:SQLContext)参数,它将无法工作。我尝试了将sqlContext从一个方法传递到另一个方法的其他几个选项,它总是给我NullPointerException或Task不可序列化的Excpetion。好的,它最终以这种方式工作,我可以从DataFrame1的一行中检索参数,并在加载DataFrame 2时使用这些值。