我在DataFrames
和Spark 2.2.0
中有以下两个Scala 2.11.8
。
df1 =
+----------+-------------------------------+
|item | other_items |
+----------+-------------------------------+
| 111 |[[444,1.0],[333,0.5],[666,0.4]]|
| 222 |[[444,1.0],[333,0.5]] |
| 333 |[] |
| 444 |[[111,2.0],[555,0.5],[777,0.2]]|
+----------+-------------------------------+
printScheme
提供以下输出:
|-- item: string (nullable = true)
|-- other_items: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- item: string (nullable = true)
| | |-- rank: double (nullable = true)
和
df2 =
+----------+-------------+
|itemA | itemB |
+----------+-------------+
| 111 | 333 |
| 222 | 444 |
| 333 | 555 |
| 444 | 777 |
+----------+-------------+
对于df2
中的每对,我想从rank
中找到df1
。为此,我应在df1
中找到相同的对,以便df1.item
等于df2.itemA
,other_items.struct.[item]
等于df2.itemB
。如果找不到这样的对,则等级应为0.
结果应该是这个:
+----------+-------------+-------------+
|itemA | itemB | rank |
+----------+-------------+-------------+
| 111 | 333 | 0.5 |
| 222 | 444 | 1.0 |
| 333 | 555 | 0.0 |
| 444 | 777 | 0.2 |
+----------+-------------+-------------+
我该怎么做?
答案 0 :(得分:1)
这应该做你想要的。诀窍是在加入之前爆炸other_items:
df2.as("df2").join(
df1.select($"item", explode($"other_items").as("other_items")).as("df1"),
$"df2.itemA" === $"df1.item" and $"df2.itemB" === $"df1.other_items.item"
, "left"
)
.select($"itemA", $"itemB", coalesce($"df1.other_items.rank", lit(0.0)).as("rank"))
.show()
答案 1 :(得分:1)
您可以通过定义udf
函数来实现您的要求,并在udf
join
之后调用dataframe
函数
import org.apache.spark.sql.functions._
def findRank = udf((items: mutable.WrappedArray[String], ranks: mutable.WrappedArray[Double], itemB: String) => {
val index = items.indexOf(itemB)
if(index != -1) ranks(index) else 0.0
})
df1.join(df2, df1("item") === df2("itemA"), "right")
.select(df2("itemA"), df2("itemB"), findRank(df1("other_items.item"), df1("other_items.rank"), df2("itemB")).as("rank"))
.show(false)
你应该dataframe
作为
+-----+-----+----+
|itemA|itemB|rank|
+-----+-----+----+
|111 |333 |0.5 |
|222 |444 |1.0 |
|333 |555 |0.0 |
|444 |777 |0.2 |
+-----+-----+----+