我使用此SQL为数据集创建session_id。如果用户处于非活动状态超过30分钟(30 * 60秒),则会分配一个新的session_id我是Spark SQL新手并尝试使用Spark SQL Context复制相同的过程。但是我遇到了一些错误。
session_id遵循命名约定:
userid_1,
userid_2,
userid_3,...
SQL(日期以秒为单位):
CREATE TABLE tablename_with_session_id AS
SELECT * , userid || '_' || SUM(new_session) OVER (PARTITION BY userid ORDER BY date asc, new_session desc rows unbounded preceding) AS session_id
FROM
(SELECT *,
CASE
WHEN (date - LAG(date) OVER (PARTITION BY userid ORDER BY date) >= 30 * 60)
THEN 1
WHEN row_number() over (partition by userid order by date) = 1
THEN 1
ELSE 0
END as new_session
FROM
tablename
)
order by date;
我尝试在Spark-Scala中使用相同的SQL:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val tableSessionID = sqlContext.sql("SELECT * , CONCAT(userid,'_',SUM(new_session)) OVER (PARTITION BY userid ORDER BY date asc, new_session desc rows unbounded preceding) AS new_session_id FROM
(SELECT *, CASE WHEN (date - LAG(date) OVER (PARTITION BY userid ORDER BY date) >= 30 * 60) THEN 1 WHEN row_number() over (partition by userid order by date) = 1 THEN 1 ELSE 0 END as new_session FROM clickstream) order by date")
有些错误建议在窗口函数中包装Spark SQL表达式..sum(new_session)..
我尝试使用多个数据框:
val temp1 = sqlContext.sql("SELECT *, CASE WHEN (date - LAG(date) OVER (PARTITION BY userid ORDER BY date) >= 30 * 60) THEN 1 WHEN row_number() over (partition by userid order by date) = 1 THEN 1 ELSE 0 END as new_session FROM clickstream")
temp1.registerTempTable("clickstream_temp1")
val temp2 = sqlContext.sql("SELECT * , SUM(new_session) OVER (PARTITION BY userid ORDER BY date asc, new_session desc rows unbounded preceding) AS s_id FROM clickstream_temp1")
temp2.registerTempTable("clickstream_temp2")
val temp3 = sqlContext.sql("SELECT * , CONCAT(userid,'_',s_id) OVER (PARTITION BY userid ORDER BY date asc, new_session desc rows unbounded preceding) AS new_session_id FROM clickstream_temp2")
仅在上述语句中返回错误。 ' val temp3 = ...' 那个CONCAT(用户ID,' _',s_id)不能在窗口函数中使用。
解决方法是什么?还有其他选择吗?
由于
答案 0 :(得分:2)
要使用带有火花窗口功能的concat,您需要使用用户定义的聚合函数(UDAF)。你不能直接使用带窗函数的concat函数。
//Extend UserDefinedAggregateFunction to write custom aggregate function
//You can also specify any constructor arguments. For instance you can have
//CustomConcat(arg1: Int, arg2: String)
class CustomConcat() extends org.apache.spark.sql.expressions.UserDefinedAggregateFunction {
import org.apache.spark.sql.types._
import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.Row
// Input Data Type Schema
def inputSchema: StructType = StructType(Array(StructField("description", StringType)))
// Intermediate Schema
def bufferSchema = StructType(Array(StructField("groupConcat", StringType)))
// Returned Data Type.
def dataType: DataType = StringType
// Self-explaining
def deterministic = true
// This function is called whenever key changes
def initialize(buffer: MutableAggregationBuffer) = {buffer(0) = " ".toString}
// Iterate over each entry of a group
def update(buffer: MutableAggregationBuffer, input: Row) = { buffer(0) = buffer.getString(0) + input.getString(0) }
// Merge two partial aggregates
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = { buffer1(0) = buffer1.getString(0) + buffer2.getString(0) }
// Called after all the entries are exhausted.
def evaluate(buffer: Row) = {buffer.getString(0)}
}
val newdescription = new CustomConcat
val newdesc1=newdescription($"description").over(windowspec)
您可以使用newdesc1作为窗口函数中串联的聚合函数。 有关更多信息,请查看: databricks udaf 我希望这会回答你的问题。