查找不同数组列的大小

时间:2018-07-06 20:23:19

标签: scala apache-spark

我正在使用Scala和Spark创建数据框。到目前为止,这是我的代码:

 val df = transformedFlattenDF
.groupBy($"market", $"city", $"carrier").agg(count("*").alias("count"), min($"bandwidth").alias("bandwidth"), first($"network").alias("network"), concat_ws(",", collect_list($"carrierCode")).alias("carrierCode")).withColumn("carrierCode", split(($"carrierCode"), ",").cast("array<string>")).withColumn("Carrier Count", collect_set("carrierCode"))

carrierCode列变为数组列。数据显示如下:

CarrierCode
1: [12,2,12]
2: [5,2,8]
3: [1,1,3]

我想创建一列来计算每个数组中不同值的数量。我尝试做collect_set,但是,它给我一个错误,说grouping expressions sequence is empty是否可以在每一行的数组中找到不同值的数量?这样,在我们的示例中,可能会有这样的列:

Carrier Count
1: 2
2: 3
3: 2

3 个答案:

答案 0 :(得分:1)

collect_set用于汇总,因此应在您的groupBy-agg步骤中应用:

val df = transformedFlattenDF.groupBy($"market", $"city", $"carrier").agg(
    count("*").alias("count"), min($"bandwidth").alias("bandwidth"),
    first($"network").alias("network"),
    concat_ws(",", collect_list($"carrierCode")).alias("carrierCode"),
    size(collect_set($"carrierCode")).as("carrier_count")  // <-- ADDED `collect_set`
  ).
  withColumn("carrierCode", split(($"carrierCode"), ",").cast("array<string>"))

如果您不想更改现有的groupBy-agg代码,则可以像以下示例一样创建UDF:

import org.apache.spark.sql.functions._

val codeDF = Seq(
  Array("12", "2", "12"),
  Array("5", "2", "8"),
  Array("1", "1", "3")
).toDF("carrier_code")

def distinctElemCount = udf( (a: Seq[String]) => a.toSet.size )

codeDF.withColumn("carrier_count", distinctElemCount($"carrier_code")).
  show
// +------------+-------------+
// |carrier_code|carrier_count|
// +------------+-------------+
// | [12, 2, 12]|            2|
// |   [5, 2, 8]|            3|
// |   [1, 1, 3]|            2|
// +------------+-------------+

答案 1 :(得分:0)

没有UDF并使用RDD转换并返回DF以供后代使用:

import org.apache.spark.sql.functions._

val df = sc.parallelize(Seq(
         ("A", 2, 100, 2), ("F", 7, 100, 1), ("B", 10, 100, 100)
         )).toDF("c1", "c2", "c3", "c4")

val x = df.select("c1", "c2", "c3", "c4").rdd.map(x => (x.get(0),  List(x.get(1), x.get(2), x.get(3)))  )
val y = x.map {case (k, vL) => (k, vL.toSet.size) }
// Manipulate back to your DF, via conversion, join, what not.

返回:

res15: Array[(Any, Int)] = Array((A,2), (F,3), (B,2))

上述解决方案更好,如后代所言。

答案 2 :(得分:0)

您可以为udf寻求帮助,您可以这样做。

loop_vars