在Spark Scala中将先前数据与当前数据进行比较

时间:2018-09-11 06:46:31

标签: scala apache-spark-sql

我想按月比较Prev.data和Current data。我有如下数据。

Data-set 1 : (Prev)                             Data-set 2 : (Latest)

   Year-month  Sum-count                 Year-Month    Sum-count
      --          --                       201808          48     
     201807       30                       201807          22   
     201806       20                       201806          20
     201805       35                       201805          20
     201804       12                       201804           9
     201803       15                       --              --

我有如上所述的数据集。我想根据年份-月份列和总和数比较这两个数据集,并需要找出百分比差异。

我正在使用spark 2.3.0和Scala 2.11。

这是模式:

import org.apache.spark.sql.functions.lag

val mdf = spark.read.format("csv").
          option("InferSchema","true").
          option("header","true").
          option("delimiter",",").
          option("charset","utf-8").
          load("c:\\test.csv")
mdf.createOrReplaceTempView("test")
val res= spark.sql("select year-month,SUM(Sum-count) as SUM_AMT from test d group by year-month")
val win = org.apache.spark.sql.expressions.Window.orderBy("data_ym")

val res1 = res.withColumn("Prev_month", lag("SUM_AMT", 1,0).over(win)).withColumn("percentage",col("Prev_month") / sum("SUM_AMT").over()).show()

我需要这样的输出:

如果百分比大于10%,则需要将标志设置为F。

set1     cnt             set2    cnt     output(Percentage)  Flag
201807   30             201807   22         7%                T
201806   20             201806   20         0%                T
201805   35             201805   20         57%               F

请帮助我。

2 个答案:

答案 0 :(得分:0)

可以通过以下方式完成:

val data1 = List(
  ("201807", 30),
  ("201806", 20),
  ("201805", 35),
  ("201804", 12),
  ("201803", 15)
)
val data2 = List(
  ("201808", 48),
  ("201807", 22),
  ("201806", 20),
  ("201805", 20),
  ("201804", 9)
)
val df1 = data1.toDF("Year-month", "Sum-count")
val df2 = data2.toDF("Year-month", "Sum-count")

val joined = df1.alias("df1").join(df2.alias("df2"), "Year-month")
joined
  .withColumn("output(Percentage)", abs($"df1.Sum-count" - $"df2.Sum-count").divide($"df1.Sum-count"))
  .withColumn("Flag", when($"output(Percentage)" > 0.1, "F").otherwise("T"))
  .show(false)

输出:

+----------+---------+---------+-------------------+----+
|Year-month|Sum-count|Sum-count|output(Percentage) |Flag|
+----------+---------+---------+-------------------+----+
|201807    |30       |22       |0.26666666666666666|F   |
|201806    |20       |20       |0.0                |T   |
|201805    |35       |20       |0.42857142857142855|F   |
|201804    |12       |9        |0.25               |F   |
+----------+---------+---------+-------------------+----+

答案 1 :(得分:0)

这是我的解决方案:

val values1 = List(List("1201807", "30")
  ,List("1201806", "20") ,
  List("1201805", "35"),
  List("1201804","12"),
  List("1201803","15")
).map(x =>(x(0), x(1)))

val values2 = List(List("201808", "48")
  ,List("1201807", "22") ,
  List("1201806", "20"),
  List("1201805","20"),
  List("1201804","9")
).map(x =>(x(0), x(1)))
import spark.implicits._
import org.apache.spark.sql.functions
val df1 = values1.toDF
val df2 = values2.toDF
df1.join(df2, Seq("_1"), "full").toDF("set", "cnt1", "cnt2")
  .withColumn("percentage1", col("cnt1")/sum("cnt1").over() * 100)
  .withColumn("percentage2", col("cnt2")/sum("cnt2").over() * 100)
  .withColumn("percentage", abs(col("percentage2") - col("percentage1")))
  .withColumn("flag", when(col("percentage") > 10, "F").otherwise("T")).na.drop().show()

结果如下:

+-------+----+----+------------------+------------------+------------------+----+
|    set|cnt1|cnt2|       percentage1|       percentage2|        percentage|flag|
+-------+----+----+------------------+------------------+------------------+----+
|1201804|  12|   9|10.714285714285714| 7.563025210084033|  3.15126050420168|   T|
|1201807|  30|  22|26.785714285714285|18.487394957983195|  8.29831932773109|   T|
|1201806|  20|  20|17.857142857142858| 16.80672268907563|1.0504201680672267|   T|
|1201805|  35|  20|             31.25| 16.80672268907563|14.443277310924369|   F|
+-------+----+----+------------------+------------------+------------------+----+

我希望这会有所帮助:)