我正在寻找加入以下两个Spark数据集的方法:
# city_visits:
person_id city timestamp
-----------------------------------------------
1 Paris 2017-01-01 00:00:00
1 Amsterdam 2017-01-03 00:00:00
1 Brussels 2017-01-04 00:00:00
1 London 2017-01-06 00:00:00
2 Berlin 2017-01-01 00:00:00
2 Brussels 2017-01-02 00:00:00
2 Berlin 2017-01-06 00:00:00
2 Hamburg 2017-01-07 00:00:00
# ice_cream_events:
person_id flavour timestamp
-----------------------------------------------
1 Vanilla 2017-01-02 00:12:00
1 Chocolate 2017-01-05 00:18:00
2 Strawberry 2017-01-03 00:09:00
2 Caramel 2017-01-05 00:15:00
因此,对于city_visits
中的每一行,ice_cream_events
中具有相同person_id
和下一个timestamp
值的行会加入,从而导致此输出:
person_id city timestamp ic_flavour ic_timestamp
---------------------------------------------------------------------------
1 Paris 2017-01-01 00:00:00 Vanilla 2017-01-02 00:12:00
1 Amsterdam 2017-01-03 00:00:00 Chocolate 2017-01-05 00:18:00
1 Brussels 2017-01-04 00:00:00 Chocolate 2017-01-05 00:18:00
1 London 2017-01-06 00:00:00 null null
2 Berlin 2017-01-01 00:00:00 Strawberry 2017-01-03 00:09:00
2 Brussels 2017-01-02 00:00:00 Strawberry 2017-01-03 00:09:00
2 Berlin 2017-01-06 00:00:00 null null
2 Hamburg 2017-01-07 00:00:00 null null
我到目前为止最接近的解决方案是以下内容,但这明显加入了ice_cream_events
中符合条件的每一行,而不仅仅是第一行:
val cv = city_visits.orderBy("person_id", "timestamp")
val ic = ice_cream_events.orderBy("person_id", "timestamp")
val result = cv.join(ic, ic("person_id") === cv("person_id")
&& ic("timestamp") > cv("timestamp"))
是否有一种(最好是有效的)方法来指定仅在第一个匹配的ice_cream_events
行上而不是所有行都需要连接?
答案 0 :(得分:1)
请求请在问题中加入sc.parallalize
代码。这样可以更容易回答。
val city_visits = sc.parallelize(Seq((1, "Paris", "2017-01-01 00:00:00"),(1, "Amsterdam", "2017-01-03 00:00:00"),(1, "Brussels", "2017-01-04 00:00:00"),(1, "London", "2017-01-06 00:00:00"),(2, "Berlin", "2017-01-01 00:00:00"),(2, "Brussels", "2017-01-02 00:00:00"),(2, "Berlin", "2017-01-06 00:00:00"),(2, "Hamburg", "2017-01-07 00:00:00"))).toDF("person_id", "city", "timestamp")
val ice_cream_events = sc.parallelize(Seq((1, "Vanilla", "2017-01-02 00:12:00"),(1, "Chocolate", "2017-01-05 00:18:00"),(2, "Strawberry", "2017-01-03 00:09:00"), (2, "Caramel", "2017-01-05 00:15:00"))).toDF("person_id", "flavour", "timestamp")
根据评论中的建议,您可以先进行连接,这将创建所有可能的行组合。
val joinedRes = city_visits.as("C").
join(ice_cream_events.as("I")
, joinType = "LEFT_OUTER"
, joinExprs =
$"C.person_id" === $"I.person_id" &&
$"C.timestamp" < $"I.timestamp"
).select($"C.person_id", $"C.city", $"C.timestamp", $"I.flavour".as("ic_flavour"), $"I.timestamp".as("ic_timestamp"))
joinedRes.orderBy($"person_id", $"timestamp").show
然后使用groupBy
子句选择第一条记录。
import org.apache.spark.sql.functions._
val firstMatchRes = joinedRes.
groupBy($"person_id", $"city", $"timestamp").
agg(first($"ic_flavour"), first($"ic_timestamp"))
firstMatchRes.orderBy($"person_id", $"timestamp").show
现在变得更加棘手。面对我。上面的连接在进行连接操作时会产生巨大的数据上升。 Spark必须等到连接完成才能运行导致内存问题的groupBy
。
使用有状态连接。为此,我们在每个执行器中维护一个状态,每个执行器只使用布隆过滤器中的本地状态发出一行。
import org.apache.spark.sql.functions._
var bloomFilter = breeze.util.BloomFilter.optimallySized[String](city_visits.count(), falsePositiveRate = 0.0000001)
val isFirstOfItsName = udf((uniqueKey: String, joinExprs:Boolean) => if (joinExprs) { // Only update bloom filter if all other expresions are evaluated to true. Dataframe evaluation of join clause order is not guranteed so we have to enforce this here.
val res = bloomFilter.contains(uniqueKey)
bloomFilter += uniqueKey
!res
} else false)
val joinedRes = city_visits.as("C").
join(ice_cream_events.as("I")
, joinType = "LEFT_OUTER"
, joinExprs = isFirstOfItsName(
concat($"C.person_id", $"C.city", $"C.timestamp"), // Unique key to identify first of its kind.
$"C.person_id" === $"I.person_id" && $"C.timestamp" < $"I.timestamp")// All the other join conditions here.
).select($"C.person_id", $"C.city", $"C.timestamp", $"I.flavour".as("ic_flavour"), $"I.timestamp".as("ic_timestamp"))
joinedRes.orderBy($"person_id", $"timestamp").show
最后结合多个执行者的结果。
val firstMatchRes = joinedRes.
groupBy($"person_id", $"city", $"timestamp").
agg(first($"ic_flavour"), first($"ic_timestamp"))