所以我有一个表,我想从中创建事件。我的用户正在观看一个视频,该视频被定义为sub_parts列表,每个sub_part都会下载字节。
例如,爱丽丝正在观看15秒5分钟的视频,她观看了前三部分,然后她跳到了第7部分再播放了两部分,但最后她从未完成视频。
所以我想为每个使用Spark SQL的用户重新创建这个事件的踪迹(很可能是UDF,但是对我有帮助,我不明白我怎样才能使它工作)
+---+------------+-------------+-------------+
| | Name | Video_part | Bytes Dl |
+---+------------+-------------+-------------+
| 1 | Alice | 1 | 200 |
| 2 | Alice | 2 | 250 |
| 3 | Alice | 3 | 400 |
| 1 | Alice | 7 | 100 |
| 2 | Alice | 8 | 200 |
| 3 | Bob | 1 | 1000 |
| 1 | Bob | 32 | 500 |
| 2 | Bob | 33 | 400 |
| 3 | Bob | 34 | 330 |
| 1 | Bob | 15 | 800 |
| 2 | Bob | 16 | 400 |
+---+------------+-------------+-------------+
所以我想要的是按照video_part中的连续整数分组,这是我的事件播放,当这个连续列表中断时,这是一个事件 skin_in 或 skip_out ,对于播放的每个部分我想得到下载字节的平均值:
+---+------------+-------------+-------------+-------------+-------------+
| | Name | Number_play | Event | Number_skips| Mean_BytesDL|
+---+------------+-------------+-------------+-------------+-------------+
| 1 | Alice | 3 | Play | 0 | 283,3 |
| 2 | Alice | 0 | Skip_in | 4 | 0 |
| 3 | Alice | 2 | Play | 0 | 150 |
| 1 | Bob | 1 | Play | 0 | 1000 |
| 2 | Bob | 0 | Skip_in | 31 | 0 |
| 3 | Bob | 3 | Play | 0 | 410 |
| 2 | Bob | 0 | Skip_out | 19 | 0 |
| 3 | Bob | 2 | Play | 0 | 600 |
+---+------------+-------------+-------------+-------------+-------------+
问题是我可以在Python或Scala中分别使用带有map和foreach的循环或子列表的sub_pandas df,但是在1 To数据上运行它需要很长时间。即使我在节点集群上运行它。
所以我想知道有没有办法在Spark SQL中做到这一点,我已经用Groupby,flatMap或Agg研究了一点UDF。但是我遇到了麻烦,因为这对我来说是全新的,希望你能以某种方式帮助我!
我想的是:
我知道这是非常具体的,但也许有人可以帮助我,
提前致谢,祝你有个美好的一天!
答案 0 :(得分:2)
使用UDF
功能会为您提供逐行计算功能,并为您传递给UDF
功能的列,并且很难满足您的标准。
我建议你使用Window
函数,在其中你可以定义分组,排序甚至框架类型。
PARTITION BY ... ORDER BY ... frame_type BETWEEN start AND end
databricks和Mastering Apache Spark 2应足以开始。
我可以建议更多的是计算Mean_BytesDL的第一阶段,你可以在其中
Window.partitionBy(col("name")).orderBy(col("Video_part").asc).rowsBetween(<choose rows so that each frame would contian all the consecutive Video_part played>)
您可以对其他列进行相同操作并删除所有不必要的行。
处理自定义frame_type
并非不可能,但肯定是一场噩梦。
同时我使用UDAF
为您提供了解决方案,但在此之前请确保有另一个column
标识用户的最新下载
+---+-----+----------+--------+------+
|sn |Name |Video_part|Bytes D1|latest|
+---+-----+----------+--------+------+
|1 |Alice|1 |200 | |
|2 |Alice|2 |250 | |
|3 |Alice|3 |400 | |
|1 |Alice|7 |100 | |
|2 |Alice|8 |200 |latest|
|3 |Bob |1 |1000 | |
|1 |Bob |32 |500 | |
|2 |Bob |33 |400 | |
|3 |Bob |34 |330 | |
|1 |Bob |15 |800 | |
|2 |Bob |16 |400 |latest|
+---+-----+----------+--------+------+
之后创建UDAF
,如下所示
private class MovieAggregateFunction(inputSourceSchema : StructType) extends UserDefinedAggregateFunction {
var previousPlay : Int = _
var previousEvent : String = _
var playCount : Int = _
var skipCount : Int = _
var sum : Double = _
var finalString : String = _
var first : Boolean = _
def inputSchema: StructType = inputSourceSchema
def bufferSchema: StructType = new StructType().add("finalOutput", StringType)
def dataType: DataType = StringType
def deterministic: Boolean = false
def initialize(buffer: MutableAggregationBuffer): Unit = {
previousPlay = 0
previousEvent = "Play"
playCount = 0
skipCount = 0
sum = 0.0
finalString = ""
first = true
buffer.update(0,"")
}
def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val sn = input.getInt(0)
val name = input.getString(1)
val vPart = input.getInt(2)
val eventType = getEventType(previousPlay, vPart)
val dPart = input.getInt(3).toDouble
val latest = input.getString(4)
if(previousEvent.equalsIgnoreCase(eventType) && eventType.equalsIgnoreCase("Play")){
playCount +=1
sum += dPart
}
if(!previousEvent.equalsIgnoreCase(eventType)){
if(first) {
finalString = name + "::" + playCount + "::" + previousEvent + "::" + "0" + "::" + sum / playCount + "&&" +
name + "::" + "0" + "::" + eventType + "::" + skipCount + "::" + "0"
}
else{
finalString = finalString+"&&"+name + "::" + playCount + "::" + previousEvent + "::" + "0" + "::" + sum / playCount +
"&&" + name + "::" + "0" + "::" + eventType + "::" + skipCount + "::" + "0"
}
playCount = 1
sum = 0
sum += dPart
previousEvent = "Play"
first = false
}
if(latest.equalsIgnoreCase("latest")){
finalString = finalString+"&&"++name+"::"+playCount+"::"+previousEvent+"::"+skipCount+"::"+sum/playCount
}
previousPlay = vPart
buffer.update(0, finalString)
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1.update(0, buffer1.getString(0) + buffer2.getString(0))
}
def evaluate(buffer: Row): Any = {
buffer.getString(0)
}
def getEventType(firstPlay: Int, secondPlay: Int): String ={
if(firstPlay < secondPlay && secondPlay - firstPlay == 1){
skipCount = 0
"Play"
}
else if(firstPlay < secondPlay && secondPlay-firstPlay > 1){
skipCount = secondPlay - firstPlay
"Skip_in"
}
else if(firstPlay > secondPlay){
skipCount = firstPlay - secondPlay
"Skip_out"
}
else
""
}
}
然后通过UDAF
调用inputSchema
并应用aggregation
函数
val udaf = new MovieAggregateFunction(df.schema)
df = df.groupBy("Name").agg(udaf(col("sn"), col("Name"), col("Video_part"), col("Bytes D1"), col("latest")).as("aggOut"))
到目前为止的输出是
+-----+------------------------------------------------------------------------------------------------------------------------+
|Name |aggOut |
+-----+------------------------------------------------------------------------------------------------------------------------+
|Bob |Bob::1::Play::0::1000.0&&Bob::0::Skip_in::31::0&&Bob::3::Play::0::410.0&&Bob::0::Skip_out::19::0&&Bob::2::Play::0::600.0|
|Alice|Alice::3::Play::0::283.3333333333333&&Alice::0::Skip_in::4::0&&Alice::2::Play::0::150.0 |
+-----+------------------------------------------------------------------------------------------------------------------------+
我们已经有了所需的输出。现在,将aggOut
列转换为单独的dataFrame
,将其转换为rdd
,split
,然后转换回dataFrame
,如下所示
val lineRdd = df.rdd.flatMap(row => row(1).toString.split("&&").toList)
val valueRdd = lineRdd.map(line => {
val list = mutable.MutableList[String]()
for(value <- line.split("::")){
list += value
}
Row.fromSeq(list)
})
val outputFields = Vector("Name", "Number_play", "Event", "Number_skips", "Mean_bytesDL")
val schema = StructType(outputFields.map(field => StructField(field, DataTypes.StringType, true)))
df = sqlContext.createDataFrame(valueRdd, schema)
df.show(false)
最终输出是
+-----+-----------+--------+------------+-----------------+
|Name |Number_play|Event |Number_skips|Mean_bytesDL |
+-----+-----------+--------+------------+-----------------+
|Bob |1 |Play |0 |1000.0 |
|Bob |0 |Skip_in |31 |0 |
|Bob |3 |Play |0 |410.0 |
|Bob |0 |Skip_out|19 |0 |
|Bob |2 |Play |0 |600.0 |
|Alice|3 |Play |0 |283.3333333333333|
|Alice|0 |Skip_in |4 |0 |
|Alice|2 |Play |0 |150.0 |
+-----+-----------+--------+------------+-----------------+
注意:最终的dataTypes
都是String
,您可以根据需要进行更改。
答案 1 :(得分:2)
我有一个更简单的方法。我先用语言解释一下。如果在有序行中有连续的整数,则会发生以下情况:对于每一行,row_number与行中的值之间的差异保持不变。 因此,在Video_part命令的Name上的窗口函数中添加一个额外的列,比如说“diff”,在其中放置Video_part - row_number()。这将使具有连续值的所有行在此列中具有相同的值。 然后你可以只是groupBy(“名称”,“差异”),你就拥有了你想要的组。 对?这是一个简单的例子,想象一下列“value”中的有序数字列表及其对应的(值 - 行索引)添加列,“diff”:
+----+------+-----+
|row |value |diff |
+----+------+-----+
|0 |2 |2 |
|1 |3 |2 |
|2 |4 |2 |
|3 |7 |4 |
|4 |8 |4 |
|5 |23 |18 |
|6 |24 |18 |
+----+------+-----+
因此,按差异分组可以获得具有连续值的行组
应用此选项需要能够在有序的行列表中提供行号的内容。 Windows可以做到这一点。现在到代码:
val win = Window.partitionBy("Name").orderBy("Video_part")
df.withColumn("diff", $"Video_part" - row_number().over(win))
然后按“diff”和“Name”进行分组,然后根据需要进行聚合
答案 2 :(得分:0)
如果您按照自己的意愿进行操作(Python,带有循环,映射和foreach),那也就不足为奇了。我会使用numpy并使用其数组逻辑来做到这一点。我将解决您的问题的简化版本,然后自己看一下是否对您有帮助,我肯定会有所帮助,因为我正是您的问题的简化版本。
因此,我有一个整数列表,以我为例,它是经过排序的,我想将其转换为一系列不连续的范围。
我要做的是建立两个列表,一个列表删除第一个元素,一个列表删除最后一个,然后从另一个减去一个。为了便于阅读,我还减去了预期值1
和后续值,因此可以与0
进行比较。我将此索引与剥离的索引列表一起压缩(使用dstack来提高效率),然后我认为其余的都非常明显。
# orig is your original array of integers, in my case, it's sorted.
import numpy as np
i = np.array(orig)
index = i[1:] - i[:-1] - 1
zipped = np.dstack((index, i[:-1], i[1:]))[0]
np_todo = zipped[index!=0,:] # selecting using a boolean array
todo = list(np_todo) # needed to use .pop()
todo.reverse() # not necessary, you can work from the top if you prefer
ranges = [] # the result we will build
singletons = [] # not worth being called a range
bottom = i[0] # bottom of first range
while todo:
_, top, next_bottom = todo.pop()
if top != bottom:
ranges.append((bottom, top))
else:
singletons.append(top)
bottom = next_bottom
print(singletons)
print(ranges)
在我自己的情况下,我将丢弃所有长度不小于3的范围,并且我非常想知道这是否足够快以满足您的需求。