Question

我正在使用Spark和dataSet API创建一些分析数据集。我得到了我正在计算一些变量的部分，它看起来像这样：

CntDstCdrs1.groupByKey(x => (x.bs_recordid, x.bs_utcdate)).agg(
   count(when(($"bc_sub_org_id" === lit(500) && $"bc_utcdate" >= $"day_1" && $"bc_utcdate" <= $"bs_utcdate") , $"bc_phonenum")).as[Long].name("count_phone_1day"),
   count(when(($"bc_sub_org_id" === lit(500) && $"bc_utcdate" >= $"day_3" && $"bc_utcdate" <= $"bs_utcdate") , $"bc_phonenum")).as[Long].name("count_phone_3day_cust"),
   count(when(($"bc_sub_org_id" === lit(500) && $"bc_utcdate" >= $"day_5" && $"bc_utcdate" <= $"bs_utcdate") , $"bc_phonenum")).as[Long].name("count_phone_5day_cust"),
   count(when(($"bc_sub_org_id" === lit(500) && $"bc_utcdate" >= $"day_7" && $"bc_utcdate" <= $"bs_utcdate") , $"bc_phonenum")).as[Long].name("count_phone_7day_cust")
  ).show()

此代码工作正常，但当我尝试为变量“count_phone_30day”添加一个计数时，我收到错误..“方法重载...” 这意味着dataSet上的agg方法签名最多需要4个Column表达式？无论如何，如果这种方法不是计算大量变量的最佳实践，那么将是哪一种？我有计数，计算不同，总和等变量。

KR，斯蒂芬

Answer 1

Dataset.groupByKey返回KeyValueGroupedDataset。

此类没有agg和varargs - 您只能提供4列作为参数

Spark数据集agg方法

1 个答案: