Spark将数年作为数组进行比较

时间:2017-03-10 06:34:45

标签: scala apache-spark

我有一个包含

的数据
+------------+----------+
|BaseFromYear|BaseToYear|
+------------+----------+
|        2013|      2013|
+------------+----------+

我需要检查两年的差异,并在另一个数据框中比较所需的年份是否存在于基年,因此创建了一个查询

val df = DF_WE.filter($"id"===3 && $"status"===1).select("BaseFromYear","BaseToYear").withColumn("diff_YY",$"BaseToYear"-$"BaseFromYear".cast(IntegerType)).withColumn("Baseyears",when($"diff_YY"===0,$BaseToYear))
 +------------+----------+-------+---------+
 |BaseFromYear|BaseToYear|diff_YY|Baseyears|
 +------------+----------+-------+---------+
 |        2013|      2013|      0|     2013|
 +------------+----------+-------+---------+

所以我得到了以上的输出但是如果从2014年到年底,并且basetoyear是2017年那么差异将是3我需要得到[2014,2015,2016,2017]作为Baseyears ..所以在下一步我有一个必需的年份说2016年需要与基准年比较。我看到isin功能会起作用吗?

1 个答案:

答案 0 :(得分:1)

我在代码中添加了评论,如果您需要进一步说明,请与我们联系。

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.IntegerType

// This is a user defined function(udf) which will populate an array of Int from BaseFromYear to BaseToYear
val generateRange: (Int, Int) => Array[Int] = (baseFromYear: Int, baseToYear: Int) => (baseFromYear to baseToYear).toArray
val sqlfunc = udf(generateRange) // Registering the UDF with spark

val df = DF_WE.filter($"id" === 3 && $"status" === 1)
  .select("BaseFromYear", "BaseToYear")
  .withColumn("diff_YY", $"BaseToYear" - $"BaseFromYear".cast(IntegerType))
  .withColumn("Baseyears", sqlfunc($"BaseFromYear", $"BaseToYear")) // using the UDF to populate new columns

df.show()
// Now lets say we are selecting records which has 2016 in the Baseyears
val filteredDf = df.where(array_contains(df("Baseyears"), 2016))
filteredDf.show()

// Seq[Row] is not type safe, please be careful about that
val isIn: (Int, Seq[Row] ) => Boolean = (num: Int, years: Seq[Row] ) => years.contains(num)
val sqlIsIn = udf(isIn)

val filteredDfBasedOnAnotherCol = df.filter(sqlIsIn(df("YY"), df("Baseyears")))