为什么在rdd排序时使用sortby时会引发错误

时间:2018-12-13 11:18:54

标签: scala apache-spark

enter image description here

这是我的功能

/**
    * 
    *
    * @param spark               spark
    * @param templateInfo        (code,(type_ids,content,lang))
    * @param pushedTemplatedInfo (CODE,PUSH_DATE,PUSHED_CNT)
    * @param templateCycle       
    * @param catTypeId           
    * @param templateCount       
    */
  def getNormalTemplate(spark: SparkSession, templateInfo: RDD[(String, (String, String, String))],
                        pushedTemplatedInfo: RDD[(String, (String, Int))],
                        templateCycle: Int, catTypeId: Int, templateCount: Int) = {
    val templateDate = pushUtil.getNextSomeday(templateCycle)
    println("templateDate:" + templateDate)
    val deleteTemplatedInfo = pushedTemplatedInfo.filter(_._2._1 >= templateDate).map(x => (x._1, x._2._1))
    val brpushedTemplatedMap = spark.sparkContext
      .broadcast(pushedTemplatedInfo.map(x => (x._1, x._2._2)).distinct().collectAsMap())
    val TemplateCodeSelection = templateInfo.filter(x => x._2._1 == catTypeId) 
      .map(x => (x._1, brpushedTemplatedMap.value.getOrElse(x._1, 0))) 
      .reduceByKey((x, y) => math.max(x, y))
      .subtractByKey(deleteTemplatedInfo) 
      .sortBy(x => (x._2, x._1))(Ordering.Tuple2(Ordering.Int,Ordering.String.reverse))

    //(code,(type_ids,content,lang))
    val res = templateInfo.map(x => x._1)
  }

谁能告诉我为什么,我正在按照How to sort a list in Scala by two fields?

的顺序进行编码

1 个答案:

答案 0 :(得分:2)

如果看到方法sortBy的签名,则将看到它需要2个参数?StartupOrdering。您需要发送ClassTag

的类标记
Tuple

您可以这样创建一个类标签:

  def sortBy[K](
      f: (T) => K,
      ascending: Boolean = true,
      numPartitions: Int = this.partitions.length)
      (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {
    this.keyBy[K](f)
        .sortByKey(ascending, numPartitions)
        .values
  }

为解决您的情况,您将这样调用sortBy:

ClassTag[(Int, String)]((Int, String).getClass)