在dataframe上进行groupby时连接maptype值

时间:2016-10-17 05:14:09

标签: scala apache-spark

我有这个包含3列的数据框 - > userId,date,generation

#include <iostream>
using namespace std;
int main() {
    int n,num,flag;
    do{
    cout<<"Enter number";
    cin>>n;
    num = n;
    flag = 0;
    while(num>0)
    {
        if(num%10 == 2 || num%10 ==3)
        {
            flag = 1;
            break;
        }
        num = num/10;
    }
    }while(flag==1);
    return 0;
}

我想根据 userId 日期 对这些值进行分组 但问题是第3列包含maptype的值,并且要求是将所有maptype值组合在一列中,最终输出应该如下所示 - &gt;

+-------+--------+----------------------------------------------------------------------------+
|userId |   date |generation                                                                  |
+-------+--------+----------------------------------------------------------------------------+
|1      |20160926|Map("screen_WiFi" -> 15.127, "upload_WiFi" -> 0.603, "total_WiFi" -> 19.551)|
|1      |20160926|Map("screen_2g" -> 0.573, "upload_2g" -> 0.466, "total_2g" -> 1.419)        |
|1      |20160926|Map("screen_3g" -> 10.084, "upload_3g" -> 80.515, "total_3g" -> 175.435)    |
+-------+--------+----------------------------------------------------------------------------+

有没有办法解决这个问题,或任何可能的解决方法?

1 个答案:

答案 0 :(得分:2)

您可以创建一个组合地图的天真用户定义聚合函数(UDAF),然后将其用作聚合函数。由于您没有定义如何在地图中为两个相同的键组合两个,我将假设键是唯一的,即对于每个userIddate,两个不同的记录中不会出现任何密钥:

/***
  * UDAF combining maps, overriding any duplicate key with "latest" value
  * @param keyType DataType of Map key 
  * @param valueType DataType of Value key
  * @tparam K key type
  * @tparam V value type
  */
class CombineMaps[K, V](keyType: DataType, valueType: DataType) extends UserDefinedAggregateFunction {
  override def inputSchema: StructType = new StructType().add("map", dataType)
  override def bufferSchema: StructType = inputSchema
  override def dataType: DataType = MapType(keyType, valueType)
  override def deterministic: Boolean = true

  override def initialize(buffer: MutableAggregationBuffer): Unit = buffer.update(0 , Map[K, V]())

  // naive implementation - assuming keys won't repeat, otherwise later value for key overrides earlier one
  override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
    val before = buffer.getAs[Map[K, V]](0)
    val toAdd = input.getAs[Map[K, V]](0)
    val result = before ++ toAdd
    buffer.update(0, result)
  }

  override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = update(buffer1, buffer2)

  override def evaluate(buffer: Row): Any = buffer.getAs[Map[String, Int]](0)
}

// instantiate a CombineMaps with the relevant types:
val combineMaps = new CombineMaps[String, Double](StringType, DoubleType)

// groupBy and aggregate
val result = input.groupBy("userId", "date").agg(combineMaps(col("generation")))