我像这样有2个输入数据帧
符合条件的
+---------+--------------------+
| cid| eligibleUIds|
+---------+--------------------+
| 1234|offer3,offer1,offer2|
| 2345| offer1,offer3|
| 3456| offer2,offer3|
| 4567| offer2|
| 5678| null|
+---------+--------------------+
建议Ds
+---------+--------------------+
| cid| suggestedUids|
+---------+--------------------+
| 1234|offer2,offer1,offer3|
| 2345|offer1,offer2,offer3|
| 3456|offer1,offer3,offer2|
| 4567|offer3,offer1,offer2|
| 5678|offer1,offer2,offer3|
+---------+--------------------+
我希望输出数据框像这样
outputDs
+---------+--------+
| cid| topUid|
+---------+--------+
|3456 |offer3 |
|5678 |null |
|4567 |offer2 |
|1234 |offer2 |
|2345 |offer1 |
+---------+--------+
想法是
我已经能够提出这样的想法
val combinedDs = eligibleDs.join(suggestedDs, Seq("cid"), "left")
val outputDs = combinedDs.map(row => {
val cid = row.getInt(0)
val eligibleUids = row.getString(1)
val suggestedUids = row.getString(2)
val suggestedUidsList = suggestedUids.split(",")
var topUid = ""
import scala.util.control.Breaks._
breakable {
for(uid <- suggestedUidsList) {
if(eligibleUids!=null && eligibleUids.contains(uid)) {
topOffer = uid
break
}
}
}
Out(cid, topUid)
})
这似乎很疯狂,有人可以帮我知道是否有更好的方法吗?
答案 0 :(得分:0)
使用dropWhile
在suggestedUids
的列表中删除不匹配的项目,然后使用headOption
检查其余列表中的第一项,这是一种生成{{1} }:
outputDs
请注意,问题中所述的import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
case class Out(cid: Int, topUid: String)
val outputDs = combinedDs.map{
case Row(cid: Int, null, _) =>
Out(cid, null)
case Row(cid: Int, eligibleUids: String, suggestedUids: String) =>
val topUid = suggestedUids.split(",").
dropWhile(!eligibleUids.contains(_)).headOption match {
case Some(uid) => uid
case None => null
}
Out(cid, topUid)
}
outputDs.show
// +----+------+
// | cid|topUid|
// +----+------+
// |1234|offer2|
// |2345|offer1|
// |3456|offer3|
// |4567|offer2|
// |5678| null|
// +----+------+
是一个DataFrame。如果将其转换为数据集,应将combinedDs
替换为case Row(...)
。