Spark SQL Group按连续的整数序列

时间:2017-04-28 10:45:16

标签: scala apache-spark apache-spark-sql

所以我有一个表,我想从中创建事件。我的用户正在观看一个视频,该视频被定义为sub_parts列表,每个sub_part都会下载字节。

例如,爱丽丝正在观看15秒5分钟的视频,她观看了前三部分,然后她跳到了第7部分再播放了两部分,但最后她从未完成视频。

所以我想为每个使用Spark SQL的用户重新创建这个事件的踪迹(很可能是UDF,但是对我有帮助,我不明白我怎样才能使它工作)

+---+------------+-------------+-------------+
|   |   Name     | Video_part  | Bytes Dl    |
+---+------------+-------------+-------------+
| 1 | Alice      |       1     |      200    |
| 2 | Alice      |       2     |      250    |
| 3 | Alice      |       3     |      400    |
| 1 | Alice      |       7     |      100    |
| 2 | Alice      |       8     |      200    |
| 3 | Bob        |       1     |     1000    |
| 1 | Bob        |       32    |      500    |
| 2 | Bob        |       33    |      400    |
| 3 | Bob        |       34    |      330    |
| 1 | Bob        |       15    |      800    |
| 2 | Bob        |       16    |      400    |
+---+------------+-------------+-------------+

所以我想要的是按照video_part中的连续整数分组,这是我的事件播放,当这个连续列表中断时,这是一个事件 skin_in skip_out ,对于播放的每个部分我想得到下载字节的平均值:

+---+------------+-------------+-------------+-------------+-------------+
|   |   Name     | Number_play |    Event    | Number_skips| Mean_BytesDL|
+---+------------+-------------+-------------+-------------+-------------+
| 1 | Alice      |       3     |     Play    |       0     |      283,3  |
| 2 | Alice      |       0     |    Skip_in  |       4     |      0      |
| 3 | Alice      |       2     |     Play    |       0     |      150    |
| 1 | Bob        |       1     |     Play    |       0     |      1000   |
| 2 | Bob        |       0     |    Skip_in  |       31    |      0      |
| 3 | Bob        |       3     |     Play    |       0     |      410    |
| 2 | Bob        |       0     |    Skip_out |       19    |      0      |
| 3 | Bob        |       2     |     Play    |       0     |      600    |
+---+------------+-------------+-------------+-------------+-------------+

问题是我可以在Python或Scala中分别使用带有map和foreach的循环或子列表的sub_pandas df,但是在1 To数据上运行它需要很长时间。即使我在节点集群上运行它。

所以我想知道有没有办法在Spark SQL中做到这一点,我已经用Groupby,flatMap或Agg研究了一点UDF。但是我遇到了麻烦,因为这对我来说是全新的,希望你能以某种方式帮助我!

我想的是:

  • SortBy名称
  • 通过每个独特的名称:
  • 使用UDF聚合video_part - >这会创建三个新列 用一个是部分
  • 上的bytesDL的平均值

我知道这是非常具体的,但也许有人可以帮助我,

提前致谢,祝你有个美好的一天!

3 个答案:

答案 0 :(得分:2)

使用UDF功能会为您提供逐行计算功能,并为您传递给UDF功能的列,并且很难满足您的标准。
我建议你使用Window函数,在其中你可以定义分组,排序甚至框架类型。

PARTITION BY ... ORDER BY ... frame_type BETWEEN start AND end

databricksMastering Apache Spark 2应足以开始。
我可以建议更多的是计算Mean_BytesDL的第一阶段,你可以在其中

Window.partitionBy(col("name")).orderBy(col("Video_part").asc).rowsBetween(<choose rows so that each frame would contian all the consecutive Video_part played>)

您可以对其他列进行相同操作并删除所有不必要的行。

处理自定义frame_type并非不可能,但肯定是一场噩梦。
同时我使用UDAF为您提供了解决方案,但在此之前请确保有另一个column标识用户的最新下载

+---+-----+----------+--------+------+
|sn |Name |Video_part|Bytes D1|latest|
+---+-----+----------+--------+------+
|1  |Alice|1         |200     |      |
|2  |Alice|2         |250     |      |
|3  |Alice|3         |400     |      |
|1  |Alice|7         |100     |      |
|2  |Alice|8         |200     |latest|
|3  |Bob  |1         |1000    |      |
|1  |Bob  |32        |500     |      |
|2  |Bob  |33        |400     |      |
|3  |Bob  |34        |330     |      |
|1  |Bob  |15        |800     |      |
|2  |Bob  |16        |400     |latest|
+---+-----+----------+--------+------+

之后创建UDAF,如下所示

private class MovieAggregateFunction(inputSourceSchema : StructType) extends UserDefinedAggregateFunction {
  var previousPlay : Int = _
  var previousEvent : String = _
  var playCount : Int = _
  var skipCount : Int = _
  var sum : Double = _
  var finalString : String = _
  var first : Boolean = _

  def inputSchema: StructType = inputSourceSchema

  def bufferSchema: StructType = new StructType().add("finalOutput", StringType)

  def dataType: DataType = StringType

  def deterministic: Boolean = false

  def initialize(buffer: MutableAggregationBuffer): Unit = {
    previousPlay = 0
    previousEvent = "Play"
    playCount = 0
    skipCount = 0
    sum = 0.0
    finalString = ""
    first = true
    buffer.update(0,"")
  }

  def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
    val sn = input.getInt(0)
    val name = input.getString(1)
    val vPart = input.getInt(2)
    val eventType = getEventType(previousPlay, vPart)
    val dPart = input.getInt(3).toDouble
    val latest = input.getString(4)
    if(previousEvent.equalsIgnoreCase(eventType) && eventType.equalsIgnoreCase("Play")){
      playCount +=1
      sum += dPart
    }
    if(!previousEvent.equalsIgnoreCase(eventType)){
      if(first) {
        finalString = name + "::" + playCount + "::" + previousEvent + "::" + "0" + "::" + sum / playCount + "&&" +
          name + "::" + "0" + "::" + eventType + "::" + skipCount + "::" + "0"
      }
      else{
        finalString = finalString+"&&"+name + "::" + playCount + "::" + previousEvent + "::" + "0" + "::" + sum / playCount +
          "&&" + name + "::" + "0" + "::" + eventType + "::" + skipCount + "::" + "0"
      }
      playCount = 1
      sum = 0
      sum += dPart
      previousEvent = "Play"
      first = false
    }
    if(latest.equalsIgnoreCase("latest")){
      finalString = finalString+"&&"++name+"::"+playCount+"::"+previousEvent+"::"+skipCount+"::"+sum/playCount
    }
    previousPlay = vPart
    buffer.update(0, finalString)
  }

  def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
    buffer1.update(0, buffer1.getString(0) + buffer2.getString(0))
  }

  def evaluate(buffer: Row): Any = {
    buffer.getString(0)
  }

  def getEventType(firstPlay: Int, secondPlay: Int): String ={
    if(firstPlay < secondPlay && secondPlay - firstPlay == 1){
      skipCount = 0
      "Play"
    }
    else if(firstPlay < secondPlay && secondPlay-firstPlay > 1){
      skipCount = secondPlay - firstPlay
      "Skip_in"
    }
    else if(firstPlay > secondPlay){
      skipCount = firstPlay - secondPlay
      "Skip_out"
    }
    else
      ""
  }
}

然后通过UDAF调用inputSchema并应用aggregation函数

val udaf = new MovieAggregateFunction(df.schema)
df = df.groupBy("Name").agg(udaf(col("sn"), col("Name"), col("Video_part"), col("Bytes D1"), col("latest")).as("aggOut"))

到目前为止的输出是

+-----+------------------------------------------------------------------------------------------------------------------------+
|Name |aggOut                                                                                                                  |
+-----+------------------------------------------------------------------------------------------------------------------------+
|Bob  |Bob::1::Play::0::1000.0&&Bob::0::Skip_in::31::0&&Bob::3::Play::0::410.0&&Bob::0::Skip_out::19::0&&Bob::2::Play::0::600.0|
|Alice|Alice::3::Play::0::283.3333333333333&&Alice::0::Skip_in::4::0&&Alice::2::Play::0::150.0                                 |
+-----+------------------------------------------------------------------------------------------------------------------------+

我们已经有了所需的输出。现在,将aggOut列转换为单独的dataFrame,将其转换为rddsplit,然后转换回dataFrame,如下所示

val lineRdd = df.rdd.flatMap(row => row(1).toString.split("&&").toList)
val valueRdd = lineRdd.map(line => {
  val list = mutable.MutableList[String]()
  for(value <- line.split("::")){
    list += value
  }
  Row.fromSeq(list)
  })
val outputFields = Vector("Name", "Number_play", "Event", "Number_skips", "Mean_bytesDL")
val schema = StructType(outputFields.map(field => StructField(field, DataTypes.StringType, true)))
df = sqlContext.createDataFrame(valueRdd, schema)
df.show(false)

最终输出是

+-----+-----------+--------+------------+-----------------+
|Name |Number_play|Event   |Number_skips|Mean_bytesDL     |
+-----+-----------+--------+------------+-----------------+
|Bob  |1          |Play    |0           |1000.0           |
|Bob  |0          |Skip_in |31          |0                |
|Bob  |3          |Play    |0           |410.0            |
|Bob  |0          |Skip_out|19          |0                |
|Bob  |2          |Play    |0           |600.0            |
|Alice|3          |Play    |0           |283.3333333333333|
|Alice|0          |Skip_in |4           |0                |
|Alice|2          |Play    |0           |150.0            |
+-----+-----------+--------+------------+-----------------+

注意:最终的dataTypes都是String,您可以根据需要进行更改。

答案 1 :(得分:2)

我有一个更简单的方法。我先用语言解释一下。如果在有序行中有连续的整数,则会发生以下情况:对于每一行,row_number与行中的值之间的差异保持不变。 因此,在Video_part命令的Name上的窗口函数中添加一个额外的列,比如说“diff”,在其中放置Video_part - row_number()。这将使具有连续值的所有行在此列中具有相同的值。 然后你可以只是groupBy(“名称”,“差异”),你就拥有了你想要的组。 对?这是一个简单的例子,想象一下列“value”中的有序数字列表及其对应的(值 - 行索引)添加列,“diff”:

+----+------+-----+
|row |value |diff |
+----+------+-----+
|0   |2     |2    |
|1   |3     |2    |
|2   |4     |2    |
|3   |7     |4    |
|4   |8     |4    |
|5   |23    |18   |
|6   |24    |18   |
+----+------+-----+

因此,按差异分组可以获得具有连续值的行组

应用此选项需要能够在有序的行列表中提供行号的内容。 Windows可以做到这一点。现在到代码:

val win = Window.partitionBy("Name").orderBy("Video_part")
df.withColumn("diff", $"Video_part" - row_number().over(win))

然后按“diff”和“Name”进行分组,然后根据需要进行聚合

答案 2 :(得分:0)

如果您按照自己的意愿进行操作(Python,带有循环,映射和foreach),那也就不足为奇了。我会使用numpy并使用其数组逻辑来做到这一点。我将解决您的问题的简化版本,然后自己看一下是否对您有帮助,我肯定会有所帮助,因为我正是您的问题的简化版本。

因此,我有一个整数列表,以我为例,它是经过排序的,我想将其转换为一系列不连续的范围。

我要做的是建立两个列表,一个列表删除第一个元素,一个列表删除最后一个,然后从另一个减去一个。为了便于阅读,我还减去了预期值1和后续值,因此可以与0进行比较。我将此索引与剥离的索引列表一起压缩(使用dstack来提高效率),然后我认为其余的都非常明显。

# orig is your original array of integers, in my case, it's sorted.
import numpy as np
i = np.array(orig)
index = i[1:] - i[:-1] - 1
zipped = np.dstack((index, i[:-1], i[1:]))[0]
np_todo = zipped[index!=0,:]  # selecting using a boolean array
todo = list(np_todo)  # needed to use .pop()
todo.reverse()  # not necessary, you can work from the top if you prefer
ranges = []     # the result we will build
singletons = [] # not worth being called a range
bottom = i[0]   # bottom of first range
while todo:
    _, top, next_bottom = todo.pop()
    if top != bottom:
        ranges.append((bottom, top))
    else:
        singletons.append(top)
    bottom = next_bottom
print(singletons)
print(ranges)

在我自己的情况下,我将丢弃所有长度不小于3的范围,并且我非常想知道这是否足够快以满足您的需求。