迭代数据帧中的列

时间:2018-04-24 05:48:08

标签: scala apache-spark dataframe spark-dataframe

我有以下数据框架 DF1

+----------+----+----+----+-----+
|      WEEK|DIM1|DIM2|  T1|   T2|
+----------+----+----+----+-----+
|2016-04-02|  14|NULL|9874|  880|
|2016-04-30|  14|  FR|9875|   13|
|2017-06-10|  15| PQR|9867|57721|
+----------+----+----+----+-----+

DF2

+----------+----+----+----+-----+
|      WEEK|DIM1|DIM2|  T1|   T2|
+----------+----+----+----+-----+
|2016-04-02|  14|NULL|9879|  820|
|2016-04-30|  14|  FR|9785|    9|
|2017-06-10|  15| XYZ|9967|57771|
+----------+----+----+----+-----+

我需要按照以下方式生成输出 -

+----------+----+----+----+-----+----+-----+-------+-------+----------+------------+
|      WEEK|DIM1|DIM2|  T1|   T2|  T1|   T2|t1_diff|t2_diff|pr_primary|pr_reference|
+----------+----+----+----+-----+----+-----+-------+-------+----------+------------+
|2016-04-02|  14|NULL|9874|  880|9879|  820|     -5|     60|         Y|           Y|
|2017-06-10|  15| PQR|9867|57721|null| null|   null|   null|         Y|           N|
|2017-06-10|  15| XYZ|null| null|9967|57771|   null|   null|         N|           Y|
|2016-04-30|  14|  FR|9875|   13|9785|    9|     90|      4|         Y|           Y|
+----------+----+----+----+-----+----+-----+-------+-------+----------+------------+

这里,t1_diff是左T1和右T1之间的差,t2_diff是左T2和右T2之间的差,如果行存在于df1而不是df2中,pr_primary是Y,并且类似地用于pr_reference。 我用以下代码生成了上面的代码

val df1 = Seq(
  ("2016-04-02", "14", "NULL", 9874, 880), ("2016-04-30", "14", "FR", 9875, 13), ("2017-06-10", "15", "PQR", 9867, 57721)
).toDF("WEEK", "DIM1", "DIM2","T1","T2")

val df2 = Seq(
  ("2016-04-02", "14", "NULL", 9879, 820), ("2016-04-30", "14", "FR", 9785, 9), ("2017-06-10", "15", "XYZ", 9967, 57771)
).toDF("WEEK", "DIM1", "DIM2","T1","T2")

import org.apache.spark.sql.functions._

val joined = df1.as("l").join(df2.as("r"), Seq("WEEK", "DIM1", "DIM2"), "fullouter")

val j1 = joined.withColumn("t1_diff",col(s"l.T1") - col(s"r.T1")).withColumn("t2_diff",col(s"l.T2") - col(s"r.T2"))
val isPresentSubstitution = udf( (x: String, y: String) => if (x == null && y == null) "N" else "Y")
j1.withColumn("pr_primary",isPresentSubstitution(col(s"l.T1"), col(s"l.T2"))).withColumn("pr_reference",isPresentSubstitution(col(s"r.T1"), col(s"r.T2"))).show

我想让它概括为任意数量的列而不仅仅是T1和T2。有人可以建议我更好的方法吗?我在火花中运行它。

3 个答案:

答案 0 :(得分:1)

为了能够设置任意数量的列,例如t1_diff,任何表达式计算其值,我们需要进行一些重构,允许以更通用的方式使用withColumn

首先,我们需要收集目标值:目标列的名称和计算其内容的表达式。这可以通过一系列元组完成:

val diffColumns = Seq(
  ("t1_diff", col("l.T1") - col("r.T1")),
  ("t2_diff", col("l.T2") - col("r.T2"))
)
// or, to make it more readable, create a dedicated "case class DiffColumn(colName: String, expression: Column)"

现在我们可以使用折叠来生成joined的连接数据框和上面的序列:

val joinedWithDiffCols = 
  diffColumns.foldLeft(joined) { case(df, diffTuple) =>
    df.withColumn(diffTuple._1, diffTuple._2)
  }

joinedWithDiffCols包含与问题中j1相同的数据。

要附加新列,您现在只需要修改diffColumns序列。您甚至可以将pr_primarypr_reference的计算放在此序列中(但在这种情况下将ref重命名为appendedColumns,更准确一些)。

更新

为了便于创建diffCollumns的元组,它也可以推广,例如:

// when both column names are same:
def generateDiff(column: String): (String, Column) = generateDiff(column, column)

// when left and right column names are different:
def generateDiff(leftCol: String, rightCol: String): (String, Column) =
  (s"${leftCol}_diff", col("l." + leftCol) - col("r." + rightCol))

val diffColumns = Seq("T1", "T2").map(generateDiff)

结束更新

答案 1 :(得分:0)

假设df2val diffCols = df1.columns .filter(_.matches("T\\d+")) .map(c => col(s"l.$c") - col(s"r.$c") as (s"${c.toLowerCase}_diff") ) 中的列名称相同,您可以执行以下操作:

joined

然后将其与joined.select( ( col("*") :+ diffCols ) :_*).show(false) //+----------+----+----+----+-----+----+-----+-------+-------+ //|WEEK |DIM1|DIM2|T1 |T2 |T1 |T2 |t1_diff|t2_diff| //+----------+----+----+----+-----+----+-----+-------+-------+ //|2016-04-02|14 |NULL|9874|880 |9879|820 |-5 |60 | //|2017-06-10|15 |PQR |9867|57721|null|null |null |null | //|2017-06-10|15 |XYZ |null|null |9967|57771|null |null | //|2016-04-30|14 |FR |9875|13 |9785|9 |90 |4 | //+----------+----+----+----+-----+----+-----+-------+-------+ 一起使用,如:

{{1}}

答案 2 :(得分:-1)

您可以通过向每个数据帧添加序列号来实现,然后根据序列号连接这两个数据帧。

    val df3 = df1.withColumn("SeqNum", monotonicallyIncreasingId)
    val df4 = df2.withColumn("SeqNum", monotonicallyIncreasingId)

    df3.as("l").join(df4.as("r"),"SeqNum").withColumn("t1_diff",col("l.T1") - col("r.T1")).withColumn("t2_diff",col("l.T2") - col("r.T2")).drop("SeqNum").show()