我有一个包含
的数据+------------+----------+
|BaseFromYear|BaseToYear|
+------------+----------+
| 2013| 2013|
+------------+----------+
我需要检查两年的差异,并在另一个数据框中比较所需的年份是否存在于基年,因此创建了一个查询
val df = DF_WE.filter($"id"===3 && $"status"===1).select("BaseFromYear","BaseToYear").withColumn("diff_YY",$"BaseToYear"-$"BaseFromYear".cast(IntegerType)).withColumn("Baseyears",when($"diff_YY"===0,$BaseToYear))
+------------+----------+-------+---------+
|BaseFromYear|BaseToYear|diff_YY|Baseyears|
+------------+----------+-------+---------+
| 2013| 2013| 0| 2013|
+------------+----------+-------+---------+
所以我得到了以上的输出但是如果从2014年到年底,并且basetoyear是2017年那么差异将是3我需要得到[2014,2015,2016,2017]作为Baseyears ..所以在下一步我有一个必需的年份说2016年需要与基准年比较。我看到isin功能会起作用吗?
答案 0 :(得分:1)
我在代码中添加了评论,如果您需要进一步说明,请与我们联系。
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.IntegerType
// This is a user defined function(udf) which will populate an array of Int from BaseFromYear to BaseToYear
val generateRange: (Int, Int) => Array[Int] = (baseFromYear: Int, baseToYear: Int) => (baseFromYear to baseToYear).toArray
val sqlfunc = udf(generateRange) // Registering the UDF with spark
val df = DF_WE.filter($"id" === 3 && $"status" === 1)
.select("BaseFromYear", "BaseToYear")
.withColumn("diff_YY", $"BaseToYear" - $"BaseFromYear".cast(IntegerType))
.withColumn("Baseyears", sqlfunc($"BaseFromYear", $"BaseToYear")) // using the UDF to populate new columns
df.show()
// Now lets say we are selecting records which has 2016 in the Baseyears
val filteredDf = df.where(array_contains(df("Baseyears"), 2016))
filteredDf.show()
// Seq[Row] is not type safe, please be careful about that
val isIn: (Int, Seq[Row] ) => Boolean = (num: Int, years: Seq[Row] ) => years.contains(num)
val sqlIsIn = udf(isIn)
val filteredDfBasedOnAnotherCol = df.filter(sqlIsIn(df("YY"), df("Baseyears")))