我正在对我创建的数据帧进行某种聚合。以下是步骤
val initDF = spark.read.format("csv").schema(someSchema).option("header","true").load(filePath).as[someCaseClass]
var maleFemaleDistribution = initDF.select("DISTRICT","GENDER","ENROLMENT_ACCEPTED","ENROLMENT_REJECTED").groupBy("DISTRICT").agg(
count( lit(1).alias("OVERALL_COUNT")),
sum(when(col("GENDER") === "M", 1).otherwise(0).alias("MALE_COUNT")),
sum(when(col("GENDER") === "F", 1).otherwise(0).alias("FEMALE_COUNT"))
).orderBy("DISTRICT")
当我在新创建的DataFrame上执行printSchema时,我没有看到列名称为我提供的别名,而是显示
maleFemaleDistribution.printSchema
root
|-- DISTRICT: string (nullable = true)
|-- count(1 AS `OVERALL_COUNT`): long (nullable = false)
|-- sum(CASE WHEN (GENDER = M) THEN 1 ELSE 0 END AS `MALE_COUNT`): long (nullable = true)
|-- sum(CASE WHEN (GENDER = F) THEN 1 ELSE 0 END AS `FEMALE_COUNT`): long (nullable = true)
我希望列名称在哪里
maleFemaleDistribution.printSchema
root
|-- DISTRICT: string (nullable = true)
|-- OVERALL_COUNT: long (nullable = false)
|-- MALE_COUNT: long (nullable = true)
|-- FEMALE_COUNT: long (nullable = true)
我正在寻求帮助,以了解为什么新的DF中没有更新Alias。我应该如何修改代码以反映Alias
中提到的列名答案 0 :(得分:1)
我还没有尝试过运行查询,但它应该是。
var maleFemaleDistribution = initDF.select("DISTRICT","GENDER","ENROLMENT_ACCEPTED","ENROLMENT_REJECTED").groupBy("DISTRICT").agg(
count(lit(1)).alias("OVERALL_COUNT"),
sum(when(col("GENDER") === "M", 1).otherwise(0)).alias("MALE_COUNT"),
sum(when(col("GENDER") === "F", 1).otherwise(0)).alias("FEMALE_COUNT")
).orderBy("DISTRICT")
答案 1 :(得分:0)
您应该在sum运算后添加别名函数。所以,而不是这个,
sum(when(col("GENDER") === "M", 1).otherwise(0).alias("MALE_COUNT"))
它应该是这样的:
sum(when(col("GENDER") === "M", 1).otherwise(0)).alias("MALE_COUNT")