我正在使用SparkSQL中的自定义聚合功能
用例如下:
我有这样的数据(样本):
+-----------+-------+-------------+--------+
|season_year|line_pn|stock_loc_num|quantity|
+-----------+-------+-------------+--------+
|Autumn-2012|ACD47PS| 22| 2|
|Autumn-2012|ACD47PS| 3| 1|
|Autumn-2012|ACD47PS| 52| 9|
|Autumn-2012|ACD47PS| 9| 1|
|Autumn-2012|ACD47PS| 1| 4|
|Autumn-2012|ACD47PS| 1| 1|
|Autumn-2012|ACD47PS| 1| 1|
|Autumn-2012|ACD47PS| 10| 2|
|Autumn-2012|ACD47PS| 12| 2|
|Autumn-2012|ACD47PS| 15| 2|
|Autumn-2012|ACD47PS| 15| 3|
|Autumn-2012|ACD47PS| 15| 3|
|Autumn-2012|ACD47PS| 16| 1|
|Autumn-2012|ACD47PS| 18| 1|
|Autumn-2012|ACD47PS| 18| 3|
|Autumn-2012|ACD47PS| 2| 49|
|Autumn-2012|ACD47PS| 2| 7|
|Autumn-2012|ACD47PS| 21| 5|
|Autumn-2012|ACD47PS| 22| 8|
|Autumn-2012|ACD47PS| 24| 3|
+-----------+-------+-------------+--------+
注意:有250K line_pn,70 stock_loc_num和season_year 2009年至2016年春季
我正在尝试编写一个SparkSQL UDAF,其中包含line_pn,stock_loc_num在聚合函数中接受2个属性 season_year 和数量
并且,再次在自定义聚合函数中(对于每组line_pn& stock_loc_num),我想按season_year和Sum(Quantity)进行分组。
df.groupBy("line_pn", "stock_loc_num").agg(seasonality(df.col("quantity"), df.col("season_year")).as("seasonality"))
然后为聚合数据创建时间序列&计算 指数平滑状态空间模型(我已经实现了它,如果stock_loc_num的line_pn是季节性的或非季节性的,它接受TimeSeries作为输入)
输出必须是:
对于line_p,stock_loc_num级别分组,必须使用季节性因子 聚合
+-------+-------------+--------+
|line_pn|stock_loc_num|seasonal|
+-------+-------------+--------+
|ACD47PS| 22| N|
|ACD47PS| 3| A|
|MOTFP70| 52| N|
+-------+-------------+--------+
我尝试了很多东西而且我无法写UDAF,请帮帮我
守则:
主要代码
import org.apache.spark.sql._
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
object UDAF {
//Extend UserDefinedAggregateFunction to write custom aggregate function
//You can also specify any constructor arguments. For instance you
val conf = new SparkConf().setAppName("HiveQL").setMaster("local[4]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
class Seasonality extends UserDefinedAggregateFunction {
// Input Data Type Schema
def inputSchema: StructType = StructType(Array(StructField("quantity", IntegerType), StructField("season", StringType)))
// Intermediate Schema
def bufferSchema: StructType = StructType(
StructField("sumQty", IntegerType) ::
StructField("season_year", StringType) :: Nil
)
// Returned Data Type .
def dataType: DataType = StringType
// Self-explaining
def deterministic = true
// This function is called whenever key changes
def initialize(buffer: MutableAggregationBuffer) = {
buffer(0) = 0 // set season_year to blank
buffer(1) = "" // set number of items to 0
}
// Iterate over each entry of a group
def update(buffer: MutableAggregationBuffer, input: Row) = {
// Clueless What should be done here ? I'm just summing the quantity attribute and pushing new String into buffer :\
buffer(0) = buffer.getInt(0) + input.getInt(0)
buffer(1) = input.getString(1)
}
// Merge two partial aggregates
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
buffer1(0) = buffer1.getInt(0) + buffer2.getInt(0)
buffer1(1) = buffer2.getString(1)
//println("Buffer1 Seq: "+buffer1.toSeq)
//println("Buffer2 Seq: "+buffer2.toSeq)
}
// Called after all the entries are exhausted.
def evaluate(buffer: Row) = {
// I dont know I'm just concatenating both the attributes here :\
//I want a Time Series of season_year and Sum(quantity) here so as to Calculate Seasonality as follows
/*
* Something Like this
*
* season_year sumQty
* Spring-2012 2
* Winter-2012 6
* Summer-2012 0
* Autumn-2012 3
* Spring-2013 1
* Winter-2013 0
* Summer-2013 3
* Autumn-2013 5
*
*
* This will be a Time Series for 2 years, 4 season
*
* say TimeSeries ts
*
* Spring Winter Summer Autumn
*
* 2012 2 6 0 3
* 2013 1 0 3 5
*
*
*
* val etsForecast = SeasonalExponentialSmoothing.train(ts, 4) // ts timeseries, 4 quarter / seasons
*
* Final Value to be returned from Aggregated function is ets.getBestParams()[2]
*
* */
buffer.getString(1) + "--"+ buffer.getInt(0)
}
}
def main (args: Array[String]) {
import sqlContext.implicits._
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("Sess.csv")
val seasonality = new Seasonality()
// Calculate seasonality value for each group
df.groupBy("line_pn", "stock_loc_num").agg(seasonality(df.col("quantity"), df.col("season_year")).as("seasonality")).show()
}
}
SeasonalExponentialSmoothingModel来源
class SeasonalExponentialSmoothingModel (
val number: Int, //The number of y that has been evaluated.
val l_array: Array[Double],
val b_array: Array[Double],
val s_array: Array[Double],
val best_index: Int,
val MSE_vector: Array[Double],
val m : Int //The seasonal period, for monthly dat, m = 12, for quaterly data, m = 4.
) {
def this( //The constructor from the very beginning. With only 3 params : the number of points, initial l and inital b given.
l: Double,
b: Double,
s: Array[Double],
m: Int
){
this(0,Array.fill(1331)(l),Array.fill(1331)(b),Array.fill(1331)(s).flatten,0,Array.fill(1331)(0.0),m)
}
// c : calibration n:number of parameters
private def gridGenerator(c: Int, n: Int) = for( i <- 0 to n-1) yield List.tabulate(pow((c+1),n).toInt)(x => ((x%(pow((c+1),n-i).toInt))/(pow((c+1),n-i-1).toInt)).toDouble/c)
val IndexedSeq(alpha,beta,gama) = gridGenerator(10,3) //Grid Search, creating for alpha and beta.
val brzAlpha = new BDV[Double](alpha.toArray)
val brzBeta = new BDV[Double](beta.toArray)
val brzGama = new BDV[Double](gama.toArray)
private val numOfIndex = alpha.length
val brzL = new BDV[Double](l_array)
val brzB = new BDV[Double](b_array)
val brzMSE = new BDV[Double](MSE_vector)
val brzS = new BDM(m,numOfIndex,s_array)
val MSE = brzMSE(best_index)
def bestParams() = (brzAlpha(best_index),brzBeta(best_index),brzGama(best_index))//Get the best alpha and beta values as a tuple.
def predict(predictionLength: Int = 12) = List.tabulate(predictionLength)(x => brzB * x.toDouble + brzB + brzL + brzS(x%m,::).t)
def bestPrediction(predictionLength: Int = 12) = predict(predictionLength).map(_(best_index))
def evaluate(y: Double) = {
val new_brzL = (brzAlpha :* (-brzS(0,::).t + y)) + ((-brzAlpha + 1.0):*(brzL + brzB))
val new_brzS = new BDM[Double](m,numOfIndex)
new_brzS(-1,::) := ((brzGama :* (-brzL - brzB + y)) + ((-brzGama + 1.0):*(brzS(0,::).t))).t
new_brzS(0 to -2,::) := brzS(1 to -1,::)
val new_brzB = (brzBeta :* (new_brzL - brzL)) + ((-brzBeta + 1.0):*brzB)
val y_predict_1 = predict(1)(0) //Step one forecast. Type: breeze.linalg.DenseVector[Double]
val error_1 = y_predict_1 - y//Setp one error. Type: breeze.linalg.DenseVector[Int]
val new_brzMSE = (((brzMSE :* brzMSE * number.toDouble) + (error_1 :* error_1)) * (1./(number+1)) ).map(sqrt(_)) //new MSE, Type: breeze.linalg.DenseVector[Double]
val new_best_index = argmax(-new_brzMSE)
new SeasonalExponentialSmoothingModel(number+1,new_brzL.toArray,new_brzB.toArray,new_brzS.data,new_best_index,new_brzMSE.toArray,m)
}
}
class SeasonalExponentialSmoothing {
/**
* Run the algorithm .
*/
def initialize(y: List[Double], m: Int) = {
val y_init = y.take(m)
val l_0 = y_init.sum/y_init.length
val b_0 = (y.take(2*m).drop(m).sum - y_init.sum)/m/m
val s_0 = y_init.map(_ - l_0).toArray//(10.7,-9.5,-2.6,1.4)
(l_0,b_0,s_0)
}
def run(y: List[Double],m:Int) = {
val number = y.length
val (l,b,s) = initialize(y,m)
val Model = new SeasonalExponentialSmoothingModel(l,b,s,m) //The initialization model...
y.foldLeft(Model)((b,a) => b.evaluate(a))
}
}
object SeasonalExponentialSmoothing {
def train(input: List[Double], m: Int): SeasonalExponentialSmoothingModel = {
new SeasonalExponentialSmoothing().run(input,m) //Input the the data to be forecasted and m is the period.
}
}