我有两个具有相同模式(A和B)的表,其中表A中的每个唯一ID也以1对1的方式存在于表B中。我想在表B中添加一列,列的名称在每行的表之间的值不同。每行只有一个区别。
例如:
{ "id1": 1,"id2": "a","name": "bob","state": "nj"}
{"id1": 2,"id2": "b","name": "sue","state": "ma"}
{"id1": 1,"id2": "a","name": "bob","state": "fl"}
{"id1": 2,"id2": "b","name": "susan","state": "ma"}
在比较它们之后,我希望表B看起来像这样:
{"id1": 1,"id2": "a","name": "bob","state": "fl", "changed_field": "state"}
{"id1": 2,"id2": "b","name": "susan","state": "ma", "changed_field": "name"}
我无法在Spark Scala的数据框中找到执行此操作的任何功能。有没有我错过的东西?
编辑:我正在使用数百到数千列
答案 0 :(得分:2)
这是一种实现这一目标的方法,而无需拼写"拼出"列,没有UDF(仅使用内置函数):
import org.apache.spark.sql.functions._
import spark.implicits._
// list of columns to compare
val comparableColumns = A.columns.tail // without id
// create Column that would result in the name of the first differing column:
val changedFieldCol: Column = comparableColumns.foldLeft(lit("")) {
case (result, col) => when(
result === "", when($"A.$col" =!= $"B.$col", lit(col)).otherwise(lit(""))
).otherwise(result)
}
// join by id1, add changedFieldCol, and then select only B's columns:
val result = A.as("A").join(B.as("B"), "id1")
.withColumn("changed_field", changedFieldCol)
.select("id1", comparableColumns.map(c => s"B.$c") :+ "changed_field": _*)
result.show(false)
// +---+---+-----+-----+-------------+
// |id1|id2|name |state|changed_field|
// +---+---+-----+-----+-------------+
// |1 |a |bob |fl |state |
// |2 |b |susan|ma |name |
// +---+---+-----+-----+-------------+
答案 1 :(得分:1)
您可以比较生成相应字符串的UDF中的字段:
import spark.implicits._
val df_a = Seq(
(1, "a", "bob", "nj"),
(2, "b", "sue", "ma")
).toDF("id1", "id2", "name", "state")
val df_b = Seq(
(1, "a", "bob", "fl"),
(2, "b", "susane", "ma")
).toDF("id1", "id2", "name", "state")
val compareFields = udf((aName:String,aState:String,bName:String,bState:String) => {
val changedState = if (aState != bState) Some("state") else None
val changedName = if (aName != bName) Some("name") else None
Seq(changedName, changedState).flatten.mkString(",")
}
)
df_b.as("b")
.join(
df_a.as("a"), Seq("id1", "id2")
)
.withColumn("changed_fields",compareFields($"a.name",$"a.state",$"b.name",$"b.state"))
.select($"id1",$"id2",$"b.name",$"b.state",$"changed_fields")
.show()
给出
+---+---+------+-----+--------------+
|id1|id2| name|state|changed_fields|
+---+---+------+-----+--------------+
| 1| a| bob| fl| state|
| 2| b|susane| ma| name|
+---+---+------+-----+--------------+
编辑:
这是一个更通用的版本,可以同时比较所有字段:
val compareFields = udf((a:Row,b:Row) => {
assert(a.schema==b.schema)
a.schema
.indices
.map(i => if(a.get(i)!=b.get(i)) Some(a.schema(i).name) else None)
.flatten
.mkString(",")
}
)
df_b.as("b")
.join(df_a.as("a"), $"a.id1" === $"b.id1" and $"a.id2" === $"b.id2")
.withColumn("changed_fields",compareFields(struct($"a.*"),struct($"b.*")))
.select($"b.id1",$"b.id2",$"b.name",$"b.state",$"changed_fields")
.show()