Spark查找返回值小于等于

时间:2018-04-06 01:52:42

标签: apache-spark apache-spark-sql spark-dataframe

Table A:
------------------------
col1|col2|col3|col4|
------------------------
A   |123 |d1  |d2  |
------------------------
A   |134 |d3  |d4  |
------------------------
B   |156 |d5  |d6  |
------------------------
B   |178 |d7  |d8  |
------------------------

Table B:
----------
col1|col2|
----------
A   |129 |
----------
A   |147 |
----------
B   |199 |
----------
B   |175 |
----------

我正在尝试在spark udf中创建一个查找,以使用col1和col2从A中查找值,以使用条件从表B中获取剩余的列 tableA.col1 = tableB.col1 and TableA.col2< = TableB.col2。

Output:
------------------------
col1|col2|col3|col4|
------------------------
A   |129 |d1  |d2  |
------------------------
A   |147 |d3  |d4  |
------------------------
B   |199 |d7  |d8  |
------------------------
B   |175 |d5  |d6  |
------------------------

这是我到目前为止所做的。它适用于等于但不确定如何获得低于价值的条件。

// get values
val values = df.select("col1,col2").map(r => r.toString()).collect.toList

//get keys
val keys = enriched_2080.select($"col3",$"col4").map(r => (r.getString(0),r.getLong(1))).collect.toList

//create a map
val lookup_map = keys.zip(values).toMap

//udf
val lookup_udf = udf{ (a:String,b:Long) =>
(a,b) match {case (x:String,y:Long) => (a,b) match {case (x:String,y:Long) => lookup_map.getOrElse((x, y),"")}
}

//call udf
df1.withColumn("result", lookup_udf(df1("col1"),
    df1("col2"))).show(false)

1 个答案:

答案 0 :(得分:1)

是否有任何理由要使用需要DataFrame数据collect编辑的查找UDF,从而对DataFrames的大小设置约束?

如果您的目标只是生成所需的输出DataFrame,则以下方法不会对DataFrame大小施加不必要的约束:

val dfA = Seq(
  ("A", 123L, "d1", "d2"),
  ("A", 134L, "d3", "d4"),
  ("B", 156L, "d5", "d6"),
  ("B", 178L, "d7", "d8")
).toDF("col1", "col2", "col3", "col4")

val dfB = Seq(
  ("A", 129L),
  ("A", 147L),
  ("B", 199L),
  ("B", 175L)
).toDF("col1", "col2")

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

// Create a DataFrame with all `dfB.col2 - dfA.col2` values that are >= 0
val dfDiff = dfA.join(dfB, Seq("col1")).
  select(
    dfA("col1"), dfA("col2").as("col2a"), dfB("col2").as("col2b"),
    dfA("col3"), dfA("col4"), (dfB("col2") - dfA("col2")).as("diff")
  ).
  where($"diff" >= 0)

dfDiff.show
// +----+-----+-----+----+----+----+
// |col1|col2a|col2b|col3|col4|diff|
// +----+-----+-----+----+----+----+
// |   A|  123|  147|  d1|  d2|  24|
// |   A|  123|  129|  d1|  d2|   6|
// |   A|  134|  147|  d3|  d4|  13|
// |   B|  156|  175|  d5|  d6|  19|
// |   B|  156|  199|  d5|  d6|  43|
// |   B|  178|  199|  d7|  d8|  21|
// +----+-----+-----+----+----+----+

// Create result dataset with minimum `diff` for every `(col1, col2)` in dfA
// and assign corresponding `dfB.col2` as the new `col2`
val dfResult = dfDiff.withColumn( "rank",
    rank.over(Window.partitionBy($"col1", $"col2a").orderBy($"diff"))
  ).
  where($"rank" === 1).
  select( $"col1", $"col2b".as("col2"), $"col3", $"col4" )

dfResult.show
// +----+----+----+----+
// |col1|col2|col3|col4|
// +----+----+----+----+
// |   A| 147|  d3|  d4|
// |   B| 175|  d5|  d6|
// |   A| 129|  d1|  d2|
// |   B| 199|  d7|  d8|
// +----+----+----+----+