UDAF合并行首先在Spark DataSet / Dataframe中排序

时间:2017-06-06 09:13:35

标签: scala apache-spark apache-spark-2.0

我们假设Spark中有dataset / dataframe,其中有3列 IDWordTimestamp

我想写一个UDAF函数,我可以做这样的事情

df.show()

ID | Word | Timestamp
1  | I    | "2017-1-1 00:01"
1  | am   | "2017-1-1 00:02"
1  | Chris | "2017-1-1 00:03"
2  | I    | "2017-1-1 00:01"
2  | am   | "2017-1-1 00:02"
2  | Jessica | "2017-1-1 00:03"

val df_merged = df.groupBy("ID")
  .sort("ID", "Timestamp")
  .agg(custom_agg("ID", "Word", "Timestamp")

df_merged.show

ID | Words         | StartTime        |      EndTime     |
1  | "I am Chris"  | "2017-1-1 00:01" | "2017-1-1 00:03" |
1  | "I am Jessica"  | "2017-1-1 00:01" | "2017-1-1 00:03" |

问题是如何确保在Words内以正确的顺序合并列UDAF

2 个答案:

答案 0 :(得分:0)

抱歉,我不使用Scala,希望你能读懂它。

Window函数可以执行您想要的操作:

df = df.withColumn('Words', f.collect_list(df['Word']).over(
    Window().partitionBy(df['ID']).orderBy('Timestamp').rowsBetween(start=Window.unboundedPreceding,
                                                                    end=Window.unboundedFollowing)))

输出:

+---+-------+-----------------+----------------+                                
| ID|   Word|        Timestamp|           Words|
+---+-------+-----------------+----------------+
|  1|      I|2017-1-1 00:01:00|  [I, am, Chris]|
|  1|     am|2017-1-1 00:02:00|  [I, am, Chris]|
|  1|  Chris|2017-1-1 00:03:00|  [I, am, Chris]|
|  2|      I|2017-1-1 00:01:00|[I, am, Jessica]|
|  2|     am|2017-1-1 00:02:00|[I, am, Jessica]|
|  2|Jessica|2017-1-1 00:03:00|[I, am, Jessica]|
+---+-------+-----------------+----------------+

然后groupBy以上数据:

df = df.groupBy(df['ID'], df['Words']).agg(
    f.min(df['Timestamp']).alias('StartTime'), f.max(df['Timestamp']).alias('EndTime'))
df = df.withColumn('Words', f.concat_ws(' ', df['Words']))

输出:

+---+------------+-----------------+-----------------+                          
| ID|       Words|        StartTime|          EndTime|
+---+------------+-----------------+-----------------+
|  1|  I am Chris|2017-1-1 00:01:00|2017-1-1 00:03:00|
|  2|I am Jessica|2017-1-1 00:01:00|2017-1-1 00:03:00|
+---+------------+-----------------+-----------------+

答案 1 :(得分:0)

以下是使用Spark 2 groupByKey(与无类型Dataset一起使用)的解决方案.groupByKey的优点是您可以访问该组(您获得{{1}在Iterator[Row]):

mapGroups