为用例实现SparkSQL UDAF

时间:2016-06-15 18:16:11

标签: scala hadoop apache-spark apache-spark-sql spark-dataframe

我正在使用SparkSQL中的自定义聚合功能 用例如下:
我有这样的数据(样本):

+-----------+-------+-------------+--------+
|season_year|line_pn|stock_loc_num|quantity|
+-----------+-------+-------------+--------+
|Autumn-2012|ACD47PS|           22|       2|
|Autumn-2012|ACD47PS|            3|       1|
|Autumn-2012|ACD47PS|           52|       9|
|Autumn-2012|ACD47PS|            9|       1|
|Autumn-2012|ACD47PS|            1|       4|
|Autumn-2012|ACD47PS|            1|       1|
|Autumn-2012|ACD47PS|            1|       1|
|Autumn-2012|ACD47PS|           10|       2|
|Autumn-2012|ACD47PS|           12|       2|
|Autumn-2012|ACD47PS|           15|       2|
|Autumn-2012|ACD47PS|           15|       3|
|Autumn-2012|ACD47PS|           15|       3|
|Autumn-2012|ACD47PS|           16|       1|
|Autumn-2012|ACD47PS|           18|       1|
|Autumn-2012|ACD47PS|           18|       3|
|Autumn-2012|ACD47PS|            2|      49|
|Autumn-2012|ACD47PS|            2|       7|
|Autumn-2012|ACD47PS|           21|       5|
|Autumn-2012|ACD47PS|           22|       8|
|Autumn-2012|ACD47PS|           24|       3|
+-----------+-------+-------------+--------+
  

注意:有250K line_pn,70 stock_loc_num和season_year   2009年至2016年春季

我正在尝试编写一个SparkSQL UDAF,其中包含line_pn,stock_loc_num在聚合函数中接受2个属性 season_year 数量

并且,再次在自定义聚合函数中(对于每组line_pn& stock_loc_num),我想按season_year和Sum(Quantity)进行分组。

df.groupBy("line_pn", "stock_loc_num").agg(seasonality(df.col("quantity"), df.col("season_year")).as("seasonality"))

然后为聚合数据创建时间序列&计算 指数平滑状态空间模型(我已经实现了它,如果stock_loc_num的line_pn是季节性的或非季节性的,它接受TimeSeries作为输入)

输出必须是:

  

对于line_p,stock_loc_num级别分组,必须使用季节性因子   聚合

+-------+-------------+--------+
|line_pn|stock_loc_num|seasonal|
+-------+-------------+--------+
|ACD47PS|           22|       N|
|ACD47PS|            3|       A|
|MOTFP70|           52|       N|
+-------+-------------+--------+

我尝试了很多东西而且我无法写UDAF,请帮帮我

守则:

  

主要代码

import org.apache.spark.sql._
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan

object UDAF {

//Extend UserDefinedAggregateFunction to write custom aggregate function
//You can also specify any constructor arguments. For instance you 

    val conf = new SparkConf().setAppName("HiveQL").setMaster("local[4]")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)




class Seasonality extends UserDefinedAggregateFunction {

  // Input Data Type Schema

  def inputSchema: StructType = StructType(Array(StructField("quantity", IntegerType), StructField("season", StringType)))

  // Intermediate Schema
  def bufferSchema: StructType = StructType(
    StructField("sumQty", IntegerType) ::
    StructField("season_year", StringType) :: Nil
  )  
  // Returned Data Type .
  def dataType: DataType = StringType

  // Self-explaining
  def deterministic = true

  // This function is called whenever key changes
  def initialize(buffer: MutableAggregationBuffer) = {
    buffer(0) = 0 // set season_year to blank
    buffer(1) = "" // set number of items to 0
  }

  // Iterate over each entry of a group
  def update(buffer: MutableAggregationBuffer, input: Row) = {   

    // Clueless What should be done here ? I'm just summing the quantity attribute and pushing new String into buffer :\

    buffer(0) = buffer.getInt(0) + input.getInt(0)
    buffer(1) = input.getString(1)

  }

  // Merge two partial aggregates
  def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
    buffer1(0) = buffer1.getInt(0) + buffer2.getInt(0)
    buffer1(1) = buffer2.getString(1)

    //println("Buffer1 Seq: "+buffer1.toSeq)
    //println("Buffer2 Seq: "+buffer2.toSeq)
  }

  // Called after all the entries are exhausted.
  def evaluate(buffer: Row) = {

    // I dont know I'm just concatenating both the attributes here :\


    //I want a Time Series of season_year and Sum(quantity) here so as to Calculate Seasonality as follows
    /*
     * Something Like this
     * 
     * season_year              sumQty
     * Spring-2012                  2
     * Winter-2012                  6
     * Summer-2012                  0
     * Autumn-2012                  3
     * Spring-2013                  1
     * Winter-2013                  0
     * Summer-2013                  3
     * Autumn-2013                  5
     * 
     * 
     * This will be a Time Series for 2 years, 4 season 
     * 
     * say TimeSeries ts
     * 
     *          Spring  Winter  Summer  Autumn
     * 
     * 2012     2               6               0               3
     * 2013     1               0               3               5
     * 
     * 
     * 
     *  val etsForecast = SeasonalExponentialSmoothing.train(ts, 4) // ts timeseries, 4 quarter / seasons
     *  
     *  Final Value to be returned from Aggregated function is ets.getBestParams()[2] 
     *  
     * */








    buffer.getString(1) + "--"+ buffer.getInt(0)
  }

}

  def main (args: Array[String]) {

    import sqlContext.implicits._

    val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .load("Sess.csv")


    val seasonality = new Seasonality()


    // Calculate seasonality value for each group
    df.groupBy("line_pn", "stock_loc_num").agg(seasonality(df.col("quantity"), df.col("season_year")).as("seasonality")).show()


  }

}
  

SeasonalExponentialSmoothingModel来源

class SeasonalExponentialSmoothingModel (
    val number: Int, //The number of y that has been evaluated.
    val l_array: Array[Double],
    val b_array: Array[Double],
    val s_array: Array[Double],
    val best_index: Int,
    val MSE_vector: Array[Double],
    val m      : Int //The seasonal period, for monthly dat, m = 12, for quaterly data, m = 4.
  ) {

  def this(    //The constructor from the very beginning. With only 3 params : the number of points, initial l and inital b given. 
    l: Double,
    b: Double,
    s: Array[Double],
    m: Int
    ){
    this(0,Array.fill(1331)(l),Array.fill(1331)(b),Array.fill(1331)(s).flatten,0,Array.fill(1331)(0.0),m)
  }
  // c : calibration   n:number of parameters
  private def gridGenerator(c: Int, n: Int) = for( i <- 0 to n-1) yield List.tabulate(pow((c+1),n).toInt)(x => ((x%(pow((c+1),n-i).toInt))/(pow((c+1),n-i-1).toInt)).toDouble/c)    


  val IndexedSeq(alpha,beta,gama) = gridGenerator(10,3)      //Grid Search, creating for alpha and beta.
  val brzAlpha = new BDV[Double](alpha.toArray)
  val brzBeta  = new BDV[Double](beta.toArray)
  val brzGama  = new BDV[Double](gama.toArray)

  private val numOfIndex = alpha.length

  val brzL = new BDV[Double](l_array)
  val brzB = new BDV[Double](b_array)
  val brzMSE = new BDV[Double](MSE_vector)
  val brzS = new BDM(m,numOfIndex,s_array)
  val MSE      = brzMSE(best_index)

  def bestParams() = (brzAlpha(best_index),brzBeta(best_index),brzGama(best_index))//Get the best alpha and beta values as a tuple.
  def predict(predictionLength: Int = 12) = List.tabulate(predictionLength)(x => brzB * x.toDouble + brzB + brzL + brzS(x%m,::).t)
  def bestPrediction(predictionLength: Int = 12) = predict(predictionLength).map(_(best_index))

  def evaluate(y: Double) = {
    val new_brzL = (brzAlpha :* (-brzS(0,::).t + y)) + ((-brzAlpha + 1.0):*(brzL + brzB))
    val new_brzS = new BDM[Double](m,numOfIndex)
    new_brzS(-1,::) := ((brzGama :* (-brzL - brzB + y)) + ((-brzGama + 1.0):*(brzS(0,::).t))).t
    new_brzS(0 to -2,::) := brzS(1 to -1,::) 
    val new_brzB = (brzBeta :* (new_brzL - brzL)) + ((-brzBeta + 1.0):*brzB)
    val y_predict_1 = predict(1)(0) //Step one forecast. Type: breeze.linalg.DenseVector[Double]
    val error_1 =  y_predict_1 - y//Setp one error. Type: breeze.linalg.DenseVector[Int]
    val new_brzMSE = (((brzMSE :* brzMSE * number.toDouble) + (error_1 :* error_1)) * (1./(number+1))  ).map(sqrt(_)) //new MSE, Type: breeze.linalg.DenseVector[Double]
    val new_best_index = argmax(-new_brzMSE)
    new SeasonalExponentialSmoothingModel(number+1,new_brzL.toArray,new_brzB.toArray,new_brzS.data,new_best_index,new_brzMSE.toArray,m)
  }  
}



class SeasonalExponentialSmoothing {
  /**
   * Run the algorithm .
  */ 

  def initialize(y: List[Double], m: Int) = {
    val y_init = y.take(m)
    val l_0 = y_init.sum/y_init.length
    val b_0 = (y.take(2*m).drop(m).sum - y_init.sum)/m/m
    val s_0 = y_init.map(_ - l_0).toArray//(10.7,-9.5,-2.6,1.4)
    (l_0,b_0,s_0)
  }


  def run(y: List[Double],m:Int) = { 
    val number = y.length
    val (l,b,s) = initialize(y,m)
    val Model = new SeasonalExponentialSmoothingModel(l,b,s,m) //The initialization model...    
    y.foldLeft(Model)((b,a) => b.evaluate(a))
  }
}

object SeasonalExponentialSmoothing {

  def train(input: List[Double], m: Int): SeasonalExponentialSmoothingModel = { 
    new SeasonalExponentialSmoothing().run(input,m) //Input the the data to be forecasted and m is the period.
  }
}

0 个答案:

没有答案