不使用别名更新Dataframe列名称

时间:2018-05-10 01:47:55

标签: apache-spark apache-spark-sql spark-dataframe

我正在对我创建的数据帧进行某种聚合。以下是步骤

val initDF = spark.read.format("csv").schema(someSchema).option("header","true").load(filePath).as[someCaseClass]

var maleFemaleDistribution = initDF.select("DISTRICT","GENDER","ENROLMENT_ACCEPTED","ENROLMENT_REJECTED").groupBy("DISTRICT").agg(
     count( lit(1).alias("OVERALL_COUNT")),
     sum(when(col("GENDER") === "M", 1).otherwise(0).alias("MALE_COUNT")),
     sum(when(col("GENDER") === "F", 1).otherwise(0).alias("FEMALE_COUNT"))
      ).orderBy("DISTRICT")

当我在新创建的DataFrame上执行printSchema时,我没有看到列名称为我提供的别名,而是显示

maleFemaleDistribution.printSchema
root
 |-- DISTRICT: string (nullable = true)
 |-- count(1 AS `OVERALL_COUNT`): long (nullable = false)
 |-- sum(CASE WHEN (GENDER = M) THEN 1 ELSE 0 END AS `MALE_COUNT`): long (nullable = true)
 |-- sum(CASE WHEN (GENDER = F) THEN 1 ELSE 0 END AS `FEMALE_COUNT`): long (nullable = true)

我希望列名称在哪里

maleFemaleDistribution.printSchema
root
 |-- DISTRICT: string (nullable = true)
 |-- OVERALL_COUNT: long (nullable = false)
 |-- MALE_COUNT: long (nullable = true)
 |-- FEMALE_COUNT: long (nullable = true) 

我正在寻求帮助,以了解为什么新的DF中没有更新Alias。我应该如何修改代码以反映Alias

中提到的列名

2 个答案:

答案 0 :(得分:1)

我还没有尝试过运行查询,但它应该是。

var maleFemaleDistribution = initDF.select("DISTRICT","GENDER","ENROLMENT_ACCEPTED","ENROLMENT_REJECTED").groupBy("DISTRICT").agg(
     count(lit(1)).alias("OVERALL_COUNT"),
     sum(when(col("GENDER") === "M", 1).otherwise(0)).alias("MALE_COUNT"),
     sum(when(col("GENDER") === "F", 1).otherwise(0)).alias("FEMALE_COUNT")
      ).orderBy("DISTRICT")

答案 1 :(得分:0)

您应该在sum运算后添加别名函数。所以,而不是这个,

sum(when(col("GENDER") === "M", 1).otherwise(0).alias("MALE_COUNT"))

它应该是这样的:

sum(when(col("GENDER") === "M", 1).otherwise(0)).alias("MALE_COUNT")