没有TypeTag可用

时间:2018-01-31 10:58:18

标签: apache-spark user-defined-functions

我试图创建一个泛型函数diff(就像R语言中的函数diff),它根据给定的二进制函数生成给定DataFrame目标列的差异向量,就像这样:

  • 给定数据:df(col1, col2)

    SEQ(
     (" A",1),
     (" A",1),
     (" A",5),
     (" B",1),
     (" B",3))。toDF

  • 给定二元函数:f

    (x:Int,y:Int)=> y - x

  • diff(df, "col1", "col2", f, 0)

  • 结果

    SEQ(
     (" A",1,0),
     (" A",1,0),
     (" A",5,4),
     (" B",1,0),
     (" B",3,2))。toDF

与R中的diff的主要区别是"它以group by方式工作"

无论如何,在编译时,会出现像这样的错误

Error:(41, 22) No TypeTag available for Array[B] val funcUdf = udf(func)

udf[Array[B], Array[A]](seqFuncApply)之类的东西不是解决方案......

  import org.apache.spark.sql._
  import org.apache.spark.sql.functions._
  import scala.reflect.ClassTag

  val spark = SparkSession.builder().appName("sparksql").master("local").getOrCreate()
  import spark.implicits._

  def diff[A: ClassTag, B: ClassTag](df: DataFrame, key: String, target: String, diffFunc: (A, A) => B, zero: B) = {

    val seqFuncApply =
      (xs: Array[A]) => {
        if (xs.length < 2) Array(zero)
        else xs.tail.zipWithIndex.map { tu =>
          val x2 = tu._1
          val idx: Int = tu._2
          val x1 = xs.init(idx)
          diffFunc(x2, x1)
        }.+:(zero)
      }

    val funcUdf = udf(seqFuncApply)

    val resultDf: DataFrame =
      df.select(key, target)
        .rdd
        .map(row => (row.getAs[A](0), row.getAs[A](1)))
        .aggregateByKey(Array[A]())(_ :+ _, _ ++ _)
        .toDF(key, target)
        .withColumn("diff_" + target, funcUdf(col(target)))

    val cbind: (DataFrame, DataFrame) => DataFrame =
      (df, df2) => {
        val x =
          df.withColumn("primaryKeyForCbind", monotonically_increasing_id())
            .withColumn("orderKeyForCbind", monotonically_increasing_id()).as("df")
        val y =
          df2.withColumn("primaryKeyForCbind", monotonically_increasing_id()).as("df2")
        x.join(y, col("df.primaryKeyForCbind") === col("df2.primaryKeyForCbind"))
          .sort("orderKeyForCbind")
          .drop("primaryKeyForCbind", "orderKeyForCbind")
      }

    cbind(
      resultDf.select(col(key), explode(col(target))).as("target"),
      resultDf.select(explode(col("diff_" + target)).as("diff_" + target)))
  }

1 个答案:

答案 0 :(得分:3)

您应该使用diff而不是A定义TypeTag的通用类型ClassTag

 def diff[A: ClassTag, B: ClassTag]

因为udf方法对于泛型类型需要TypeTag

PS:这个错误应该在编译时抛出。