我试图创建一个泛型函数diff
(就像R语言中的函数diff),它根据给定的二进制函数生成给定DataFrame目标列的差异向量,就像这样:
给定数据:df(col1, col2)
SEQ(
(" A",1),
(" A",1),
(" A",5),
(" B",1),
(" B",3))。toDF
给定二元函数:f
(x:Int,y:Int)=> y - x
diff(df, "col1", "col2", f, 0)
结果
SEQ(
(" A",1,0),
(" A",1,0),
(" A",5,4),
(" B",1,0),
(" B",3,2))。toDF
与R中的diff
的主要区别是"它以group by
方式工作"
无论如何,在编译时,会出现像这样的错误
Error:(41, 22) No TypeTag available for Array[B]
val funcUdf = udf(func)
和udf[Array[B], Array[A]](seqFuncApply)
之类的东西不是解决方案......
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import scala.reflect.ClassTag
val spark = SparkSession.builder().appName("sparksql").master("local").getOrCreate()
import spark.implicits._
def diff[A: ClassTag, B: ClassTag](df: DataFrame, key: String, target: String, diffFunc: (A, A) => B, zero: B) = {
val seqFuncApply =
(xs: Array[A]) => {
if (xs.length < 2) Array(zero)
else xs.tail.zipWithIndex.map { tu =>
val x2 = tu._1
val idx: Int = tu._2
val x1 = xs.init(idx)
diffFunc(x2, x1)
}.+:(zero)
}
val funcUdf = udf(seqFuncApply)
val resultDf: DataFrame =
df.select(key, target)
.rdd
.map(row => (row.getAs[A](0), row.getAs[A](1)))
.aggregateByKey(Array[A]())(_ :+ _, _ ++ _)
.toDF(key, target)
.withColumn("diff_" + target, funcUdf(col(target)))
val cbind: (DataFrame, DataFrame) => DataFrame =
(df, df2) => {
val x =
df.withColumn("primaryKeyForCbind", monotonically_increasing_id())
.withColumn("orderKeyForCbind", monotonically_increasing_id()).as("df")
val y =
df2.withColumn("primaryKeyForCbind", monotonically_increasing_id()).as("df2")
x.join(y, col("df.primaryKeyForCbind") === col("df2.primaryKeyForCbind"))
.sort("orderKeyForCbind")
.drop("primaryKeyForCbind", "orderKeyForCbind")
}
cbind(
resultDf.select(col(key), explode(col(target))).as("target"),
resultDf.select(explode(col("diff_" + target)).as("diff_" + target)))
}
答案 0 :(得分:3)
您应该使用diff
而不是A
定义TypeTag
的通用类型ClassTag
:
def diff[A: ClassTag, B: ClassTag]
因为udf
方法对于泛型类型需要TypeTag
。
PS:这个错误应该在编译时抛出。