我有一个笛卡尔RDD允许我在特定时间范围内过滤RDD,但是我需要得到RDD的最小值,这样我就可以计算出每个记录到首先出现的条目的增量时间。 / p>
我有一个案例类,如下所示:
case class auction(id: String, prodID: String, timestamp: Long)
我把两个RDD放在一起,其中一个包含笔记拍卖,另一个包含在该时间段内发生的拍卖,如下所示:
val specificmessages = allauctions.cartesian(winningauction)
.filter( (x, y) => x.timestamp > y.timestamp - 10 &&
x.timestamp < y.timestamp + 10 &&
x.productID == y.productID )
我想在specificmessages函数中添加一个字段,该字段将包含每条记录与具有最小值的拍卖时间戳之间的差值。
答案 0 :(得分:1)
您可以像这样使用DataFrame:
import org.apache.spark.sql.{functions => f}
import org.apache.spark.sql.expressions.Window
// Convert RDDs to DFs
val allDF = allauctions.toDF
val winDF = winningauction.toDF("winId", "winProdId", "winTimestamp")
// Prepare join conditions
val prodCond = $"prodID" === $"winProdID"
val tsCond = f.abs($"timestamp" - $"winTimestamp") < 10
// Create window
val w = Window
.partitionBy($"id", $"prodID", $"timestamp")
.orderBy($"winTimestamp")
val joined = allDF
.join(winDF, prodCond && tsCond)
.select($"*", first($"winTimestamp").over(w).alias("mintimestamp")
使用普通RDD
// Create PairRDDs
def allPairs = allauctions.map(a => (a.prodID, a))
def winPairs = winauctions.map(a => (a.prodID, a))
allPairs
.join(winPairs) // Join by prodId -> RDD[(prodID, (auction, auction))]
// Filter timestamp
.filter{case (_, (x, y)) => (x.timestamp - y.timestamp).abs < 10} //
.values // Drop key -> RDD[(auction, auction)]
.groupByKey // Group by allAuctions -> RDD[(auction, Seq[auction])]
.flatMap{ case (k, vals) => {
val minTs = vals.map(_.timestamp).min // Find min ts from winauction
vals.map(v => (k, v, minTs))
}} // -> RDD[(auction, auction, ts)]