我正在使用Scala和Spark创建数据框。到目前为止,这是我的代码:
val df = transformedFlattenDF
.groupBy($"market", $"city", $"carrier").agg(count("*").alias("count"), min($"bandwidth").alias("bandwidth"), first($"network").alias("network"), concat_ws(",", collect_list($"carrierCode")).alias("carrierCode")).withColumn("carrierCode", split(($"carrierCode"), ",").cast("array<string>")).withColumn("Carrier Count", collect_set("carrierCode"))
carrierCode列变为数组列。数据显示如下:
CarrierCode
1: [12,2,12]
2: [5,2,8]
3: [1,1,3]
我想创建一列来计算每个数组中不同值的数量。我尝试做collect_set
,但是,它给我一个错误,说grouping expressions sequence is empty
是否可以在每一行的数组中找到不同值的数量?这样,在我们的示例中,可能会有这样的列:
Carrier Count
1: 2
2: 3
3: 2
答案 0 :(得分:1)
collect_set
用于汇总,因此应在您的groupBy-agg
步骤中应用:
val df = transformedFlattenDF.groupBy($"market", $"city", $"carrier").agg(
count("*").alias("count"), min($"bandwidth").alias("bandwidth"),
first($"network").alias("network"),
concat_ws(",", collect_list($"carrierCode")).alias("carrierCode"),
size(collect_set($"carrierCode")).as("carrier_count") // <-- ADDED `collect_set`
).
withColumn("carrierCode", split(($"carrierCode"), ",").cast("array<string>"))
如果您不想更改现有的groupBy-agg
代码,则可以像以下示例一样创建UDF:
import org.apache.spark.sql.functions._
val codeDF = Seq(
Array("12", "2", "12"),
Array("5", "2", "8"),
Array("1", "1", "3")
).toDF("carrier_code")
def distinctElemCount = udf( (a: Seq[String]) => a.toSet.size )
codeDF.withColumn("carrier_count", distinctElemCount($"carrier_code")).
show
// +------------+-------------+
// |carrier_code|carrier_count|
// +------------+-------------+
// | [12, 2, 12]| 2|
// | [5, 2, 8]| 3|
// | [1, 1, 3]| 2|
// +------------+-------------+
答案 1 :(得分:0)
没有UDF并使用RDD转换并返回DF以供后代使用:
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(
("A", 2, 100, 2), ("F", 7, 100, 1), ("B", 10, 100, 100)
)).toDF("c1", "c2", "c3", "c4")
val x = df.select("c1", "c2", "c3", "c4").rdd.map(x => (x.get(0), List(x.get(1), x.get(2), x.get(3))) )
val y = x.map {case (k, vL) => (k, vL.toSet.size) }
// Manipulate back to your DF, via conversion, join, what not.
返回:
res15: Array[(Any, Int)] = Array((A,2), (F,3), (B,2))
上述解决方案更好,如后代所言。
答案 2 :(得分:0)
您可以为udf寻求帮助,您可以这样做。
loop_vars