Table A:
------------------------
col1|col2|col3|col4|
------------------------
A |123 |d1 |d2 |
------------------------
A |134 |d3 |d4 |
------------------------
B |156 |d5 |d6 |
------------------------
B |178 |d7 |d8 |
------------------------
Table B:
----------
col1|col2|
----------
A |129 |
----------
A |147 |
----------
B |199 |
----------
B |175 |
----------
我正在尝试在spark udf中创建一个查找,以使用col1和col2从A中查找值,以使用条件从表B中获取剩余的列 tableA.col1 = tableB.col1 and TableA.col2< = TableB.col2。
Output:
------------------------
col1|col2|col3|col4|
------------------------
A |129 |d1 |d2 |
------------------------
A |147 |d3 |d4 |
------------------------
B |199 |d7 |d8 |
------------------------
B |175 |d5 |d6 |
------------------------
这是我到目前为止所做的。它适用于等于但不确定如何获得低于价值的条件。
// get values
val values = df.select("col1,col2").map(r => r.toString()).collect.toList
//get keys
val keys = enriched_2080.select($"col3",$"col4").map(r => (r.getString(0),r.getLong(1))).collect.toList
//create a map
val lookup_map = keys.zip(values).toMap
//udf
val lookup_udf = udf{ (a:String,b:Long) =>
(a,b) match {case (x:String,y:Long) => (a,b) match {case (x:String,y:Long) => lookup_map.getOrElse((x, y),"")}
}
//call udf
df1.withColumn("result", lookup_udf(df1("col1"),
df1("col2"))).show(false)
答案 0 :(得分:1)
是否有任何理由要使用需要DataFrame数据collect
编辑的查找UDF,从而对DataFrames的大小设置约束?
如果您的目标只是生成所需的输出DataFrame,则以下方法不会对DataFrame大小施加不必要的约束:
val dfA = Seq(
("A", 123L, "d1", "d2"),
("A", 134L, "d3", "d4"),
("B", 156L, "d5", "d6"),
("B", 178L, "d7", "d8")
).toDF("col1", "col2", "col3", "col4")
val dfB = Seq(
("A", 129L),
("A", 147L),
("B", 199L),
("B", 175L)
).toDF("col1", "col2")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
// Create a DataFrame with all `dfB.col2 - dfA.col2` values that are >= 0
val dfDiff = dfA.join(dfB, Seq("col1")).
select(
dfA("col1"), dfA("col2").as("col2a"), dfB("col2").as("col2b"),
dfA("col3"), dfA("col4"), (dfB("col2") - dfA("col2")).as("diff")
).
where($"diff" >= 0)
dfDiff.show
// +----+-----+-----+----+----+----+
// |col1|col2a|col2b|col3|col4|diff|
// +----+-----+-----+----+----+----+
// | A| 123| 147| d1| d2| 24|
// | A| 123| 129| d1| d2| 6|
// | A| 134| 147| d3| d4| 13|
// | B| 156| 175| d5| d6| 19|
// | B| 156| 199| d5| d6| 43|
// | B| 178| 199| d7| d8| 21|
// +----+-----+-----+----+----+----+
// Create result dataset with minimum `diff` for every `(col1, col2)` in dfA
// and assign corresponding `dfB.col2` as the new `col2`
val dfResult = dfDiff.withColumn( "rank",
rank.over(Window.partitionBy($"col1", $"col2a").orderBy($"diff"))
).
where($"rank" === 1).
select( $"col1", $"col2b".as("col2"), $"col3", $"col4" )
dfResult.show
// +----+----+----+----+
// |col1|col2|col3|col4|
// +----+----+----+----+
// | A| 147| d3| d4|
// | B| 175| d5| d6|
// | A| 129| d1| d2|
// | B| 199| d7| d8|
// +----+----+----+----+