Scala Window Partitionby更新随机记录

时间:2018-10-04 13:45:16

标签: scala random window updates

我有以下数据:

group_id    id  name
----        --  ----
G1          1   apple
G1          2   orange
G1          3   apple
G1          4   banana
G1          5   apple
G2          6   orange
G2          7   apple
G2          8   apple
G3          7   banana
G3          8   orange

我想将每个组的1条随机记录更新为1,其余所有内容都应为零,如下所示:

group_id    id  name   random_pick
----        --  ----   -------------------
G1          1   apple       0
G1          2   orange      0
G1          3   apple       0
G1          4   banana      0
G1          5   apple       1
G2          6   orange      0
G2          7   apple       1
G2          8   apple       0
G3          7   banana      0
G3          8   orange      1

我的想法:

  1. 添加默认值为0的列
  2. 使用Window.partitionBy(“ group_id”),然后获取每个组的计数,在1和计数之间取随机数,将记录更新为1

但是在斯卡拉如何? :(

谢谢!

2 个答案:

答案 0 :(得分:1)

怎么样……...

case class MyRow(group_id: Int, id: Int, name: String, randomPick: Boolean = false)

val randomPicks = myData.groupBy(_.groupId).toList.flatMap{
  case (_, l) => 
   val h :: t = scala.util.Random.shuffle(l)
   h.copy(randomPick = true) :: t
}

答案 1 :(得分:0)

比@TerryDactyl更详细

case class Tup(groupId: String,
               id: Int,
               name: String,
               randomPick: Boolean = false)

val ts = Seq(
  Tup("G1", 1, "apple"),
  Tup("G1", 2, "orange"),
  Tup("G1", 3, "apple"),
  Tup("G1", 4, "banana"),
  Tup("G1", 5, "apple"),
  Tup("G2", 6, "orange"),
  Tup("G2", 7, "apple"),
  Tup("G2", 8, "apple"),
  Tup("G3", 7, "banana"),
  Tup("G3", 8, "orange")
)

val grouped = ts.groupBy(_.groupId)
val withChosen = grouped.map{case (_, ts) => 
  val l = ts.length
  val i = scala.util.Random.nextInt(l)
  ts.zipWithIndex.map{ case (tup, idx) =>
    if (idx == i) tup.copy(randomPick = true)
    else tup
  }
}