我正在使用类似下面的Spark DataFrame:
user | 1 | 2 | 3 | 4 | ... | 53
-------------------------------
1 | 1 | 0 | 0 | 1 | ... | 1
2 | 0 | 1 | 1 | 1 | ... | 0
3 | 1 | 1 | 0 | 0 | ... | 1
.
.
.
n | 1 | 0 | 1 | 1 | ... | 0
其中包含表示用户ID的列,然后是包含布尔值的一年中每周的列,表示用户是否在该周内处于活动状态。
我的目标是将其减少到如下表格:
user | active_start | active_end | duration
-------------------------------------------
1 | 1 | 1 | 1
1 | 4 | 4 | 1
1 | 53 | 53 | 1
2 | 2 | 4 | 3
3 | 1 | 2 | 2
3 | 53 | 53 | 1
.
.
.
n | 1 | 1 | 1
n | 3 | 4 | 2
包含持续活动的时段。
关于如何操作表/聚合值以便在检测到间隙时创建新行,我有些不知所措。
我尝试使用Island / Gap检测代码来生成这些组,但是无法实现一个版本,该版本不会检测并为较大的子岛生成行。
任何帮助将不胜感激, 谢谢!
答案 0 :(得分:2)
这是另一个建议,也使用flatMap
,但内部有foldLeft
来计算间隔:
case class Interval(user: Int, active_start: Int, active_end: Int, duration: Int)
def computeIntervals(userId: Int, weeks: Seq[Int]): TraversableOnce[Interval] = {
// First, we get the indexes where the value is 1
val indexes: Seq[Int] = weeks.zipWithIndex.collect {
case (value, index) if value == 1 => index
}
// Then, we find the "breaks" in the sequence (i.e. when the difference between indexes is > 1)
val breaks: Seq[Int] = indexes.foldLeft((List[Int](), -1)) { (pair, currentValue) =>
val (breaksBuffer: List[Int], lastValue: Int) = pair
if ((currentValue - lastValue) > 1 && lastValue >= 0) (breaksBuffer :+ lastValue :+ currentValue, currentValue)
else (breaksBuffer, currentValue)
}._1
// Then, we add the first and last indexes and re-organize in pairs
val breakPairs = (indexes.head +: breaks :+ indexes.last).map(_ + 1).grouped(2)
// Finally, we convert each pair to an interval and return
breakPairs.map {
case List(lower, upper) => Interval(userId, lower, upper, upper-lower+1)
}
}
运行:
val df = Seq(
(1, 1, 0, 0, 1, 1),
(2, 0, 1, 1, 1, 0),
(3, 0, 0, 1, 0, 1),
(4, 1, 1, 0, 0, 1)
).toDF
import spark.implicits._
df.flatMap { row: Row =>
val (userId, weeksAsSeq) = ((row.toSeq.head.asInstanceOf[Int], row.toSeq.drop(1).map(_.asInstanceOf[Int])))
computeIntervals(userId, weeksAsSeq)
}.show
+----+------------+----------+--------+
|user|active_start|active_end|duration|
+----+------------+----------+--------+
| 1| 1| 1| 1|
| 1| 4| 5| 2|
| 2| 2| 4| 3|
| 3| 3| 3| 1|
| 3| 5| 5| 1|
| 4| 1| 2| 2|
| 4| 5| 5| 1|
+----+------------+----------+--------+
答案 1 :(得分:0)
只需flatMap
您的df,其功能就是计算每一行的指标。
然后为新DF提供列名。
val newDf = yourDf
.flatMap(row => {
val userId = row.getInt(0)
val arrayBuffer = ArrayBuffer[(Int, Int, Int, Int)]()
var start = -1
for (i <- 1 to 53) {
val active = row.getInt(i)
if (active == 1 && start == -1) {
start = i
}
else if (active == 0 && start != -1) {
val duration = i - start + 1
val end = i - 1
arrayBuffer.append((userId, start, end, duration))
start = -1
}
}
arrayBuffer
})
.toDF("user", "active_start", "active_end", "duration" )