在火花数据帧中将ArrayType(StringType)转换为IntegerType

时间:2018-11-14 08:01:10

标签: apache-spark apache-spark-sql

我正在尝试groupBy列名host并将类型为ArrayType(StringType)的列的平均类型强制转换为ArrayType(IntegerType)之后。

它抛出错误以下

 `cannot resolve `avg(variables)` due to datatype mismatch: function average requires numeric types, not ArrayType(IntegerType,true);

输入数据-分组之前的示例数据框

|request|time         |type   |host |service       |    variables      |
|REST   |1542111483170|RESTFUL|KAFKA|www.google.com|[Duration, 7,Type] |
|REST   |1542111486570|RESTFUL|KAFKA|www.google.com|[Duration, 9, Type]|

如何将ArrayType(StringType)转换或处理为IntegerType,即列变量为ArrayType(varaible.variable:String,varaible.value:String,varaible.TypeString),我想将数组varaible.value的第二个值转换为Integer进行聚合(平均计算)?

案例分类:

 case class ServiceActivity(val request: String, val time: Long, val Type: String, val host: String, val service: String, val variables: Array[Variables])

 case class Variables(val variable: String, val value: String, val Type: String)

下面的代码:

val report = df.select("*").where(array_contains(df("variables.variable"),"Duration"))
val intermediate = report.withColumn("variables", col(variables.value).cast(org.apache.spark.sql.types.ArrayType(org.apache.spark.sql.types.IntegerType,true)
intermediate.withColumn("duration",$"variables".getItem(2)).drop("variables").withColumnRenamed("duration","variables")

按组分组(错误):

 intermediate.groupBy(intermediate("host")).agg(Map("variables"->"avg"))

任何解决方法。

谢谢

1 个答案:

答案 0 :(得分:0)

通过拆分数组并使用def f2(self, a): pass 方法进行排序

concat_ws

谢谢