如何使用Spark DataFrames合并具有相同ID的数组?

时间:2015-09-17 23:44:01

标签: java sql apache-spark dataframe apache-spark-sql

我的表格看起来像这样

+-------+--------------------+
|id     |                  c1|
+-------+--------------------+
|      1|ArrayBuffer(a,b)    |
|      1|ArrayBuffer(c  )    |
|      2|ArrayBuffer(d  )    |
|      2|ArrayBuffer(e,f)    |
|      2|ArrayBuffer(g  )    |
|      3|ArrayBuffer(h  )    |
+-------+--------------------+

我希望输出看起来像这样

+-------+--------------------+
|id     |                  c1|
+-------+--------------------+
|      1|ArrayBuffer(a,b,c)  |
|      2|ArrayBuffer(c,d,e,f,g)
|      3|ArrayBuffer(h  )    |
+-------+--------------------+

这就是我的想法

SQLQuery =“SELECT table.id,join(table.c1)FROM FROM GROUP GROUP table.id

 sqlContext.udf().register("join",
               new UDF1<ArrayBuffer, ArrayBuffer>() {
                   @Override
                   public ArrayBuffer call(ArrayBuffer idArray) {

                      // how do I join them?

                       return idArray;

                   }
              }, DataTypes.StringType);

0 个答案:

没有答案