spark表达式在聚合后重命名列列表

时间:2018-10-26 06:00:21

标签: scala apache-spark apache-spark-sql

我已经编写了以下代码来分组和聚合列

 val gmList = List("gc1","gc2","gc3")
 val aList = List("val1","val2","val3","val4","val5")

 val cype = "first"

 val exprs = aList.map((_ -> cype )).toMap

 dfgroupBy(gmList.map (col): _*).agg (exprs).show

但这会创建一个在所有列中附加聚合名称的列,如下所示

所以我想给名字first(val1)-> val1加上别名,我​​想使这段代码成为exprs的一部分通用

  +----------+----------+-------------+-------------------------+------------------+---------------------------+------------------------+-------------------+
 |    gc1   |  gc2     | gc3         |        first(val1)      |      first(val2)|       first(val3)          |       first(val4)      |       first(val5) |
 +----------+----------+-------------+-------------------------+------------------+---------------------------+------------------------+-------------------+

3 个答案:

答案 0 :(得分:1)

您可以略微更改生成表达式的方式,并在其中使用函数alias

import org.apache.spark.sql.functions.col
val aList = List("val1","val2","val3","val4","val5")
val exprs = aList.map(c => first(col(c)).alias(c) )
dfgroupBy( gmList.map(col) : _*).agg(exprs.head , exprs.tail: _*).show

答案 1 :(得分:1)

一种方法是在随后的 27-10-2018 18:37:08 : cache.CacheBeanPostProcessor , postProcessBeanDefinitionRegistry start 27-10-2018 18:37:08 : cache.CacheBeanPostProcessor , postProcessBeanFactory Error | java.lang.RuntimeException: Reloading agent exited via exception, please raise a jira Error | at org.springsource.loaded.agent.ClassPreProcessorAgentAdapter.transform(ClassPreProcessorAgentAdapter.java:110) Error | at sun.instrument.TransformerManager.transform(TransformerManager.java:188) Error | at sun.instrument.InstrumentationImpl.transform(InstrumentationImpl.java:428) Error | at sun.misc.Unsafe.defineAnonymousClass(Native Method) Error | at java.lang.invoke.InnerClassLambdaMetafactory.spinInnerClass(InnerClassLambdaMetafactory.java:326) Error | at java.lang.invoke.InnerClassLambdaMetafactory.buildCallSite(InnerClassLambdaMetafactory.java:194) Error | at java.lang.invoke.LambdaMetafactory.metafactory(LambdaMetafactory.java:304) Error | at java.lang.invoke.CallSite.makeSite(CallSite.java:302) Error | at java.lang.invoke.MethodHandleNatives.linkCallSiteImpl(MethodHandleNatives.java:307) Error | at java.lang.invoke.MethodHandleNatives.linkCallSite(MethodHandleNatives.java:297) Error | at com.mysql.cj.conf.ConnectionUrl.buildConnectionStringCacheKey(ConnectionUrl.java:247) Error | at com.mysql.cj.conf.ConnectionUrl.getConnectionUrlInstance(ConnectionUrl.java:186) Error | at com.mysql.cj.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:204) Error | at org.apache.tomcat.jdbc.pool.PooledConnection.connectUsingDriver(PooledConnection.java:278) Error | at org.apache.tomcat.jdbc.pool.PooledConnection.connect(PooledConnection.java:182) Error | at org.apache.tomcat.jdbc.pool.ConnectionPool.createConnection(ConnectionPool.java:701) Error | at org.apache.tomcat.jdbc.pool.ConnectionPool.borrowConnection(ConnectionPool.java:635) Error | at org.apache.tomcat.jdbc.pool.ConnectionPool.init(ConnectionPool.java:486) Error | at org.apache.tomcat.jdbc.pool.ConnectionPool.<init>(ConnectionPool.java:144) Error | at org.apache.tomcat.jdbc.pool.DataSourceProxy.pCreatePool(DataSourceProxy.java:116) Error | at org.apache.tomcat.jdbc.pool.DataSourceProxy.createPool(DataSourceProxy.java:103) Error | at org.apache.tomcat.jdbc.pool.DataSourceProxy.getConnection(DataSourceProxy.java:127) Error | at org.springframework.jdbc.datasource.LazyConnectionDataSourceProxy.afterPropertiesSet(LazyConnectionDataSourceProxy.java:162) Error | at org.springframework.jdbc.datasource.LazyConnectionDataSourceProxy.<init>(LazyConnectionDataSourceProxy.java:106) 中将聚合列别名为原始列名称。我还建议将单个聚合函数(即select)推广为函数列表,如下所示:

first

答案 2 :(得分:0)

这是一个更通用的版本,可以与任何聚合函数一起使用,并且不需要预先命名聚合列。像往常一样建立分组的df,然后使用:

val colRegex = raw"^.+\((.*?)\)".r
val newCols = df.columns.map(c => col(c).as(colRegex.replaceAllIn(c, m => m.group(1))))
df.select(newCols: _*)

这将仅提取括号内的内容,而不管调用什么聚合函数(例如first(val) -> valsum(val) -> valcount(val) -> val等)。