Scala:如何添加一个列,其中包含在两个表之间更改的已更改字段的值

时间:2017-12-19 19:21:08

标签: scala apache-spark dataframe apache-spark-sql

我有两个具有相同模式(A和B)的表,其中表A中的每个唯一ID也以1对1的方式存在于表B中。我想在表B中添加一列,列的名称在每行的表之间的值不同。每行只有一个区别。

例如:

表A:

{ "id1": 1,"id2": "a","name": "bob","state": "nj"}

{"id1": 2,"id2": "b","name": "sue","state": "ma"}

表B:

{"id1": 1,"id2": "a","name": "bob","state": "fl"}

{"id1": 2,"id2": "b","name": "susan","state": "ma"}

在比较它们之后,我希望表B看起来像这样:

{"id1": 1,"id2": "a","name": "bob","state": "fl", "changed_field": "state"}

{"id1": 2,"id2": "b","name": "susan","state": "ma", "changed_field": "name"}

我无法在Spark Scala的数据框中找到执行此操作的任何功能。有没有我错过的东西?

编辑:我正在使用数百到数千列

2 个答案:

答案 0 :(得分:2)

这是一种实现这一目标的方法,而无需拼写"拼出"列,没有UDF(仅使用内置函数):

import org.apache.spark.sql.functions._
import spark.implicits._

// list of columns to compare
val comparableColumns = A.columns.tail // without id

// create Column that would result in the name of the first differing column:
val changedFieldCol: Column = comparableColumns.foldLeft(lit("")) {
  case (result, col) => when(
    result === "", when($"A.$col" =!= $"B.$col", lit(col)).otherwise(lit(""))
  ).otherwise(result)
}

// join by id1, add changedFieldCol, and then select only B's columns:
val result = A.as("A").join(B.as("B"), "id1")
  .withColumn("changed_field", changedFieldCol)
  .select("id1", comparableColumns.map(c => s"B.$c") :+ "changed_field": _*)

result.show(false)
// +---+---+-----+-----+-------------+
// |id1|id2|name |state|changed_field|
// +---+---+-----+-----+-------------+
// |1  |a  |bob  |fl   |state        |
// |2  |b  |susan|ma   |name         |
// +---+---+-----+-----+-------------+

答案 1 :(得分:1)

您可以比较生成相应字符串的UDF中的字段:

import spark.implicits._

val df_a = Seq(
  (1, "a", "bob", "nj"),
  (2, "b", "sue", "ma")
).toDF("id1", "id2", "name", "state")

val df_b = Seq(
  (1, "a", "bob", "fl"),
  (2, "b", "susane", "ma")
).toDF("id1", "id2", "name", "state")

val compareFields = udf((aName:String,aState:String,bName:String,bState:String) => {
  val changedState = if (aState != bState) Some("state") else None
  val changedName = if (aName != bName) Some("name") else None
  Seq(changedName, changedState).flatten.mkString(",")
 }
)


df_b.as("b")
.join(
   df_a.as("a"), Seq("id1", "id2")
)
.withColumn("changed_fields",compareFields($"a.name",$"a.state",$"b.name",$"b.state"))
.select($"id1",$"id2",$"b.name",$"b.state",$"changed_fields")
.show()

给出

+---+---+------+-----+--------------+
|id1|id2|  name|state|changed_fields|
+---+---+------+-----+--------------+
|  1|  a|   bob|   fl|         state|
|  2|  b|susane|   ma|          name|
+---+---+------+-----+--------------+

编辑:

这是一个更通用的版本,可以同时比较所有字段:

val compareFields = udf((a:Row,b:Row) => {
  assert(a.schema==b.schema)
  a.schema
    .indices
    .map(i => if(a.get(i)!=b.get(i)) Some(a.schema(i).name) else None)
    .flatten
    .mkString(",")
}
)


df_b.as("b")
  .join(df_a.as("a"), $"a.id1" === $"b.id1" and $"a.id2" === $"b.id2")
    .withColumn("changed_fields",compareFields(struct($"a.*"),struct($"b.*")))
    .select($"b.id1",$"b.id2",$"b.name",$"b.state",$"changed_fields")
  .show()