附加两个数据帧并更新数据

时间:2017-10-25 09:23:50

标签: scala apache-spark dataframe

大家好我想基于pos_id和article_id字段更新旧数据帧。 如果元组(pos_id,article_id)存在,我将每列添加到旧列中,如果它不存在,我将添加新列。它工作正常。但是我不知道如何处理这种情况,当数据帧空白时,在这种情况下,我会将第二个数据帧中的新行添加到旧数据帧中。这就是我做的事情

    val histocaisse = spark.read
          .format("csv")
          .option("header", "true") //reading the headers
          .load("C:/Users/MHT/Desktop/histocaisse_dte1.csv")

val hist = histocaisse
  .withColumn("pos_id", 'pos_id.cast(LongType))
  .withColumn("article_id", 'pos_id.cast(LongType))
  .withColumn("date", 'date.cast(DateType))
  .withColumn("qte", 'qte.cast(DoubleType))
  .withColumn("ca", 'ca.cast(DoubleType))


val histocaisse2 = spark.read
  .format("csv")
  .option("header", "true") //reading the headers

  .load("C:/Users/MHT/Desktop/histocaisse_dte2.csv")

val hist2 = histocaisse2.withColumn("pos_id", 'pos_id.cast(LongType))
  .withColumn("article_id", 'pos_id.cast(LongType))
  .withColumn("date", 'date.cast(DateType))
  .withColumn("qte", 'qte.cast(DoubleType))
  .withColumn("ca", 'ca.cast(DoubleType))
hist2.show(false)

+------+----------+----------+----+----+
|pos_id|article_id|date      |qte |ca  |
+------+----------+----------+----+----+
|1     |1         |2000-01-07|2.5 |3.5 |
|2     |2         |2000-01-07|14.7|12.0|
|3     |3         |2000-01-07|3.5 |1.2 |
+------+----------+----------+----+----+

+------+----------+----------+----+----+
|pos_id|article_id|date      |qte |ca  |
+------+----------+----------+----+----+
|1     |1         |2000-01-08|2.5 |3.5 |
|2     |2         |2000-01-08|14.7|12.0|
|3     |3         |2000-01-08|3.5 |1.2 |
|4     |4         |2000-01-08|3.5 |1.2 |
|5     |5         |2000-01-08|14.5|1.2 |
|6     |6         |2000-01-08|2.0 |1.25|
+------+----------+----------+----+----+

+------+----------+----------+----+----+
|pos_id|article_id|date      |qte |ca  |
+------+----------+----------+----+----+
|1     |1         |2000-01-08|5.0 |7.0 |
|2     |2         |2000-01-08|39.4|24.0|
|3     |3         |2000-01-08|7.0 |2.4 |
|4     |4         |2000-01-08|3.5 |1.2 |
|5     |5         |2000-01-08|14.5|1.2 |
|6     |6         |2000-01-08|2.0 |1.25|
+------+----------+----------+----+----+

这是解决方案,我找到了

    val df = hist2.join(hist1, Seq("article_id", "pos_id"), "left")
      .select($"pos_id", $"article_id",
        coalesce(hist2("date"), hist1("date")).alias("date"),
        (coalesce(hist2("qte"), lit(0)) + coalesce(hist1("qte"), lit(0))).alias("qte"),
        (coalesce(hist2("ca"), lit(0)) + coalesce(hist1("ca"), lit(0))).alias("ca"))
      .orderBy("pos_id", "article_id")

当hist1为空时,这种情况不起作用。请帮忙吗? 非常感谢

2 个答案:

答案 0 :(得分:0)

不确定我是否理解正确,但如果问题有时候第二个数据帧是空的,这会导致连接崩溃,那么您可以尝试这样做:

val checkHist1Empty = Try(hist1.first)
val df = checkHist1Empty match {
            case Success(df) => {
                             hist2.join(hist1, Seq("article_id", "pos_id"), "left")
                              .select($"pos_id", $"article_id",
                                coalesce(hist2("date"), hist1("date")).alias("date"),
                                (coalesce(hist2("qte"), lit(0)) + coalesce(hist1("qte"), lit(0))).alias("qte"),
                                (coalesce(hist2("ca"), lit(0)) + coalesce(hist1("ca"), lit(0))).alias("ca"))
                              .orderBy("pos_id", "article_id")
                            }
            case Failure(e) => {
                            hist2.select($"pos_id", $"article_id",
                                coalesce(hist2("date")).alias("date"),
                                coalesce(hist2("qte"), lit(0)).alias("qte"),
                                coalesce(hist2("ca"), lit(0)).alias("ca"))
                              .orderBy("pos_id", "article_id")
            }
        }

这基本上会在执行连接之前检查hist1是否为空。如果它是空的,它会根据相同的逻辑生成df,但仅应用于hist2数据帧。如果它包含信息,它会应用你所拥有的逻辑,你说它可以工作。

答案 1 :(得分:0)

而不是进行连接,为什么不对两个数据帧进行联合,然后对groupBy(pos_id,article_id)进行联合,并将udf应用于qte和ca的每个列和。

val df3 = df1.unionAll(df2)
val df4 = df3.groupBy("pos_id", "article_id").agg($"pos_id", $"article_id", max("date"), sum("qte"), sum("ca"))