改变火花中的数据捕获

时间:2018-04-04 19:22:24

标签: scala apache-spark

我有要求做,但我很困惑如何做到这一点。 我有两个数据帧。所以我第一次得到以下数据文件1

file1的 prodid,lastupdatedate,指标

00001,,A
00002,01-25-1981,A
00003,01-26-1982,A
00004,12-20-1985,A

输出应为

0001,1900-01-01, 2400-01-01, A
0002,1981-01-25, 2400-01-01, A
0003,1982-01-26, 2400-01-01, A
0004,1985-12-20, 2400-10-01, A

第二次我得到另一个文件2

prodid,lastupdatedate,指标

00002,01-25-2018,U
00004,01-25-2018,U
00006,01-25-2018,A
00008,01-25-2018,A

我希望最终结果如

00001,1900-01-01,2400-01-01,A
00002,1981-01-25,2018-01-25,I
00002,2018-01-25,2400-01-01,A
00003,1982-01-26,2400-01-01,A
00004,1985-12-20,2018-01-25,I
00004,2018-01-25,2400-01-01,A
00006,2018-01-25,2400-01-01,A
00008,2018-01-25,2400-01-01,A

所以无论第二个文件中的更新是什么,日期应该出现在第二列中,默认日期(2400-01-01)应该出现在第三列和相关性指标中。默认指标为A

我是这样开始的

val spark=SparkSession.builder()
    .master("local")
    .appName("creating data frame for csv")
    .getOrCreate()
   
    import spark.implicits._ 
    val df = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .csv("d:/prod.txt")
  
    val df1 = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .csv("d:/prod1.txt")
  

val newdf = df.na.fill("01-01-1900",Seq("lastupdatedate"))

if((df1("indicator")=='U') && (df1("prodid")== newdf("prodid"))){
    val df3 = df1.except(newdf)
    }

1 个答案:

答案 0 :(得分:1)

您应该join使用prodid并使用一些when函数将数据帧操作为预期输出。您应该filter第二行的更新数据帧并将它们合并(我已经包含了用于解释代码各部分的注释)

import org.apache.spark.sql.functions._
//filling empty lastupdatedate and changing the date to the expected format
val newdf = df.na.fill("01-01-1900",Seq("lastupdatedate"))
  .withColumn("lastupdatedate", date_format(unix_timestamp(trim(col("lastupdatedate")), "MM-dd-yyyy").cast("timestamp"), "yyyy-MM-dd"))

//changing the date to the expected format of the second dataframe
val newdf1 = df1.withColumn("lastupdatedate", date_format(unix_timestamp(trim(col("lastupdatedate")), "MM-dd-yyyy").cast("timestamp"), "yyyy-MM-dd"))

//joining both dataframes and updating columns according to your needs
val tempdf = newdf.as("table1").join(newdf1.as("table2"),Seq("prodid"), "outer")
    .select(col("prodid"),
      when(col("table1.lastupdatedate").isNotNull, col("table1.lastupdatedate")).otherwise(col("table2.lastupdatedate")).as("lastupdatedate"),
      when(col("table1.indicator").isNotNull, when(col("table2.lastupdatedate").isNotNull, col("table2.lastupdatedate")).otherwise(lit("2400-01-01"))).otherwise(lit("2400-01-01")).as("defaultdate"),
      when(col("table2.indicator").isNull, col("table1.indicator")).otherwise(when(col("table2.indicator") === "U", lit("I")).otherwise(col("table2.indicator"))).as("indicator"))

//filtering tempdf for duplication
val filtereddf = tempdf.filter(col("indicator") === "I")
                        .withColumn("lastupdatedate", col("defaultdate"))
                        .withColumn("defaultdate", lit("2400-01-01"))
                        .withColumn("indicator", lit("A"))

//finally merging both dataframes
tempdf.union(filtereddf).sort("prodid", "lastupdatedate").show(false)

应该给你

+------+--------------+-----------+---------+
|prodid|lastupdatedate|defaultdate|indicator|
+------+--------------+-----------+---------+
|1     |1900-01-01    |2400-01-01 |A        |
|2     |1981-01-25    |2018-01-25 |I        |
|2     |2018-01-25    |2400-01-01 |A        |
|3     |1982-01-26    |2400-01-01 |A        |
|4     |1985-12-20    |2018-01-25 |I        |
|4     |2018-01-25    |2400-01-01 |A        |
|6     |2018-01-25    |2400-01-01 |A        |
|8     |2018-01-25    |2400-01-01 |A        |
+------+--------------+-----------+---------+