我有一个微妙的Spark问题,我无法绕过我的脑袋。
我们有两个RDD(来自Cassandra)。 RDD1包含Actions
,RDD2包含Historic
数据。两者都有一个id可以匹配/加入。但问题是这两个表有一个N:N关系。 Actions
包含多个具有相同ID的行,Historic
也是如此。以下是两个表中的一些示例日期。
Actions
时间实际上是一个时间戳
id | time | valueX
1 | 12:05 | 500
1 | 12:30 | 500
2 | 12:30 | 125
Historic
set_at实际上是一个时间戳
id | set_at| valueY
1 | 11:00 | 400
1 | 12:15 | 450
2 | 12:20 | 50
2 | 12:25 | 75
我们如何以某种方式加入这两个表,我们得到像这样的结果
1 | 100 # 500 - 400 for Actions#1 with time 12:05 because Historic was in that time at 400
1 | 50 # 500 - 450 for Actions#2 with time 12:30 because H. was in that time at 450
2 | 50 # 125 - 75 for Actions#3 with time 12:30 because H. was in that time at 75
我无法想出一个感觉正确的好解决方案,而无需对大型数据集进行大量迭代。我总是要考虑从Historic
集合中创建一个范围,然后以某种方式检查Actions
是否适合范围,例如(11:00 - 12:15)进行计算。但这对我来说似乎很慢。有没有更有效的方法呢?在我看来,这种问题可能很受欢迎,但我还没有找到任何暗示。你会如何在火花中解决这个问题?
到目前为止我目前的尝试(完成代码的一半)
case class Historic(id: String, set_at: Long, valueY: Int)
val historicRDD = sc.cassandraTable[Historic](...)
historicRDD
.map( row => ( row.id, row ) )
.reduceByKey(...)
// transforming to another case which results in something like this; code not finished yet
// (List((Range(0, 12:25), 400), (Range(12:25, NOW), 450)))
// From here we could join with Actions
// And then some .filter maybe to select the right Lists tuple
答案 0 :(得分:4)
这是一个有趣的问题。我也花了一些时间搞清楚一种方法。这就是我想出的:
给出Action(id, time, x)
和Historic(id, time, y)
在Spark中:
val actionById = actions.keyBy(_.id)
val historyById = historic.keyBy(_.id)
val actionByHistory = actionById.join(historyById)
val filteredActionByidTime = actionByHistory.collect{ case (k,(action,historic)) if (action.time>historic.t) => ((action.id, action.time),(action,historic))}
val topHistoricByAction = filteredActionByidTime.reduceByKey{ case ((a1:Action,h1:Historic),(a2:Action, h2:Historic)) => (a1, if (h1.t>h2.t) h1 else h2)}
// we are done, let's produce a report now
val report = topHistoricByAction.map{case ((id,time),(action,historic)) => (id,time,action.X -historic.y)}
使用上面提供的数据,报告如下:
report.collect
Array[(Int, Long, Int)] = Array((1,43500,100), (1,45000,50), (2,45000,50))
(我将时间转换为秒,以便有一个简单的时间戳)
答案 1 :(得分:0)
经过几个小时的思考,尝试和失败后,我想出了这个解决方案。我不确定它是否有用,但由于缺乏其他选项,这是我的解决方案。
首先,我们展开case class Historic
case class Historic(id: String, set_at: Long, valueY: Int) {
val set_at_map = new java.util.TreeMap[Long, Int]() // as it seems Scala doesn't provides something like this with similar operations we'll need a few lines later
set_at_map.put(0, valueY) // Means from the beginning of Epoch ...
set_at_map.put(set_at, valueY) // .. to the set_at date
// This is the fun part. With .getHistoricValue we can pass any timestamp and we will get the a value of the key back that contains the passed date. For more information look at this answer: http://stackoverflow.com/a/13400317/1209327
def getHistoricValue(date: Long) : Option[Int] = {
var e = set_at_map.floorEntry(date)
if (e != null && e.getValue == null) {
e = set_at_map.lowerEntry(date)
}
if ( e == null ) None else e.getValue()
}
}
案例类已准备好,现在我们将其付诸行动
val historicRDD = sc.cassandraTable[Historic](...)
.map( row => ( row.id, row ) )
.reduceByKey( (row1, row2) => {
row1.set_at_map.put(row2.set_at, row2.valueY) // we add the historic Events up to each id
row1
})
// Now we load the Actions and map it by id as we did with Historic
val actionsRDD = sc.cassandraTable[Actions](...)
.map( row => ( row.id, row ) )
// Now both RDDs have the same key and we can join them
val fin = actionsRDD.join(historicRDD)
.map( row => {
( row._1.id,
(
row._2._1.id,
row._2._1.valueX - row._2._2.getHistoricValue(row._2._1.time).get // returns valueY for that timestamp
)
)
})
我是Scala的新手,所以如果我们可以在某个地方改进此代码,请告诉我。
答案 2 :(得分:0)
我知道这个问题已经得到解答,但我想添加另一个对我有用的解决方案 -
您的数据 -
Actions
id | time | valueX
1 | 12:05 | 500
1 | 12:30 | 500
2 | 12:30 | 125
Historic
id | set_at| valueY
1 | 11:00 | 400
1 | 12:15 | 450
2 | 12:20 | 50
2 | 12:25 | 75
Actions
和Historic
Combined id | time | valueX | record-type 1 | 12:05 | 500 | Action 1 | 12:30 | 500 | Action 2 | 12:30 | 125 | Action 1 | 11:00 | 400 | Historic 1 | 12:15 | 450 | Historic 2 | 12:20 | 50 | Historic 2 | 12:25 | 75 | Historic
编写自定义分区程序并使用repartitionAndSortWithinPartitions按id
进行分区,但按time
排序。
Partition-1 1 | 11:00 | 400 | Historic 1 | 12:05 | 500 | Action 1 | 12:15 | 450 | Historic 1 | 12:30 | 500 | Action Partition-2 2 | 12:20 | 50 | Historic 2 | 12:25 | 75 | Historic 2 | 12:30 | 125 | Action
遍历每个分区的记录。
如果是Historical
记录,请将其添加到地图中,或者如果已有该ID,请更新地图 - 使用地图跟踪每valueY
个id
每个分区。
如果是Action
记录,请从地图中获取valueY
值并从valueX
地图M
Partition-1 traversal in order
M={ 1 -> 400} // A new entry in map M
1 | 100 // M(1) = 400; 500-400
M={1 -> 450} // update M, because key already exists
1 | 50 // M(1)
Partition-2 traversal in order
M={ 2 -> 50} // A new entry in M
M={ 2 -> 75} // update M, because key already exists
2 | 50 // M(2) = 75; 125-75
您可以尝试按time
进行分区和排序,但稍后需要合并分区。这可能会增加一些复杂性。
这一点,我发现它更适合我们在使用时间范围加入时通常会得到的多对多连接。