在spark scala

时间:2018-04-16 06:28:07

标签: scala apache-spark spark-dataframe partitioning

转换火花数据框架

+----+---------+------+
|name|date     |amount|
+----+---------+------+
|Jhon|4/6/2018 |  100 |
|Jhon|4/6/2018 |  200 |
+----+---------+------+
|Jhon|4/13/2018|   300|
+----+---------+------+
|Jhon|4/20/2018 |  500|
+----+---------+------+
|Lee |5/4/2018 |  100 |
+----+---------+------+
|Lee |4/4/2018 |  200 |
+----+---------+------+
|Lee |5/4/2018 |  300 |
+----+---------+------+
|Lee |4/11/2018 |  700|
+----+---------+------+

预期数据框:

+----+---------+------+
|name|date     |amount|
+----+---------+------+
|Jhon|4/6/2018 |  100 |
|Jhon|4/6/2018 |  200 |
+----+---------+------+
|Jhon|4/13/2018|   100|
+----+---------+------+
|Jhon|4/13/2018|   200|
+----+---------+------+
|Jhon|4/13/2018|   300|
+----+---------+------+
|Jhon|4/20/2018 |  100 |
+----+---------+------+
|Jhon|4/20/2018 |  200|
+----+---------+------+
|Jhon|4/20/2018|   300|
+----+---------+------+
|Jhon|4/20/2018 |  500|
+----+---------+------+
|Lee |5/4/2018 |  100 |
+----+---------+------+
|Lee |5/4/2018 |  200 |
+----+---------+------+
|Lee |5/4/2018 |  300 |
+----+---------+------|
|Lee |5/11/2018 |  100|
+----+---------+------+
|Lee |4/11/2018 |  200|
+----+---------+------+
|Lee |5/11/2018 |  300|
+----+---------+------+
|Lee |4/11/2018 |  700|
+----+---------+------+

所以这里300是04/13/2018的新值,04/06/2018的100,200也会显示04/13/2018,类似于不同名称的下周五日期。我们有没有办法在Spark Scala中做到这一点。  任何帮助将不胜感激。

我的代码仅适用于名称' John'并且只有foFridayfriday日期'4/6/2018'4/13/2018

def main(args: Array[String]){
    val conf = new SparkConf().setAppName("Excel-read-write").setMaster("local")
    val sc = new SparkContext(conf)
    val sqlc = new org.apache.spark.sql.SQLContext(sc)
    val ss = SparkSession.builder().master("local").appName("Excel-read-write").getOrCreate()
    import ss.sqlContext.implicits._
    var df1 = sqlc.read.format("com.databricks.spark.csv")
             .option("header", "true")
             .option("inferSchema", "true")
             .load("oldRecords.csv")
    df1.show(false)
    println("---- df1 row count ----"+df1.count())
    if(df1.count()>0){
      for (i <- 0 until (df1.count().toInt)-1) {
        var df2 = df1.unionAll(df1)//.union(df1)//df3
        //df2.show(false)
        var w1 = org.apache.spark.sql.expressions.Window.orderBy("date")
        var df3 = df2.withColumn("previousAmount",  lag("amount",1).over(w1)).withColumn("newdate", lag("date",1).over(w1))
        // df3.show(false)
        var df4 = df3.filter((df3.col("newdate").isNotNull))//(df3.col("new_date").isNotNull)
        //df4.show(false)
        var df5 = df4.select("name","amount","newdate").distinct() 
        println("-----------"+df5.show(false))
        df1 = df5.withColumnRenamed("newdate", "date")
      }
    }
    }

2 个答案:

答案 0 :(得分:1)

根据您的问题,如果您尝试将所有week添加到该name的最高日期。这是你可以做的。

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import org.joda.time.LocalDate
// input data 
val dataDF  = Seq(
  ("Jhon", "4/6/2018", 100),
  ("Jhon", "4/6/2018", 200),
  ("Jhon", "4/13/2018", 300),
  ("Jhon", "4/20/2018", 500),
  ("Lee", "5/4/2018", 100),
  ("Lee", "4/4/2018", 200),
  ("Lee", "5/4/2018", 300),
  ("Lee", "4/11/2018", 700)
).toDF("name", "date", "amount")
  .withColumn("date", to_date($"date", "MM/dd/yyyy"))

val window = Window.partitionBy($"name")

//find the maximum date of each name
val df = dataDF.withColumn("maxDate", max($"date").over(window))

创建UDF以查找两周之间的所有周

val calculateDate = udf((min: String, max: String) => {
  // to collect all the dates
  val totalDates = scala.collection.mutable.MutableList[LocalDate]()
  var start = LocalDate.parse(min)
  val end = LocalDate.parse(max)
  while ( {
    !start.isAfter(end)
  }) {
    totalDates += start
    start = start.plusWeeks(1)
  }
  totalDates.map(_.toString("MM/dd/yyyy"))
})

现在应用来自UDF

explodearray获得的UDF
val finalDf = df.withColumn("date", explode(calculateDate($"date", $"maxDate")))
                .drop("maxDate")

输出:

+----+----------+------+
|name|date      |amount|
+----+----------+------+
|Jhon|04/06/2018|100   |
|Jhon|04/13/2018|100   |
|Jhon|04/20/2018|100   |
|Jhon|04/06/2018|200   |
|Jhon|04/13/2018|200   |
|Jhon|04/20/2018|200   |
|Jhon|04/13/2018|300   |
|Jhon|04/20/2018|300   |
|Jhon|04/20/2018|500   |
|Lee |05/04/2018|100   |
|Lee |04/04/2018|200   |
|Lee |04/11/2018|200   |
|Lee |04/18/2018|200   |
|Lee |04/25/2018|200   |
|Lee |05/02/2018|200   |
|Lee |05/04/2018|300   |
|Lee |04/11/2018|700   |
|Lee |04/18/2018|700   |
|Lee |04/25/2018|700   |
|Lee |05/02/2018|700   |
+----+----------+------+

我希望这有帮助!

答案 1 :(得分:0)

String(function() {...})