我需要编写代码,以提供每个国家人口最多的城市。 这是输入数据:
DataFrame = {
/** Input data */
val inputDf = Seq(
("Warsaw", "Poland", "1 764 615"),
("Cracow", "Poland", "769 498"),
("Paris", "France", "2 206 488"),
("Villeneuve-Loubet", "France", "15 020"),
("Pittsburgh PA", "United States", "302 407"),
("Chicago IL", "United States", "2 716 000"),
("Milwaukee WI", "United States", "595 351"),
("Vilnius", "Lithuania", "580 020"),
("Stockholm", "Sweden", "972 647"),
("Goteborg", "Sweden", "580 020")
).toDF("name", "country", "population")
println("Input:")
inputDf.show(false)
我的解决方法是:
val topPopulation = inputDf
// .select("name", "country", "population")
.withColumn("population", regexp_replace($"population", " ", "").cast("Integer"))
// .agg(max($"population").alias(("population")))
// .withColumn("population", regexp_replace($"population", " ", "").cast("Integer"))
// .withColumn("country", $"country")
// .withColumn("name", $"name")
// .cast("Integer")
.groupBy("country")
.agg(
max("population").alias("population")
)
.orderBy($"population".desc)
// .orderBy("max(population)")
topPopulation
但是我很麻烦,因为“只能在具有相同列数的表上执行,但是第一个表有2列,第二个表有3列; “
输入:
+-----------------+-------------+----------+
|name |country |population|
+-----------------+-------------+----------+
|Warsaw |Poland |1 764 615 |
|Cracow |Poland |769 498 |
|Paris |France |2 206 488 |
|Villeneuve-Loubet|France |15 020 |
|Pittsburgh PA |United States|302 407 |
|Chicago IL |United States|2 716 000 |
|Milwaukee WI |United States|595 351 |
|Vilnius |Lithuania |580 020 |
|Stockholm |Sweden |972 647 |
|Goteborg |Sweden |580 020 |
+-----------------+-------------+----------+
预期:
+----------+-------------+----------+
|name |country |population|
+----------+-------------+----------+
|Warsaw |Poland |1 764 615 |
|Paris |France |2 206 488 |
|Chicago IL|United States|2 716 000 |
|Vilnius |Lithuania |580 020 |
|Stockholm |Sweden |972 647 |
+----------+-------------+----------+
实际:
+-------------+----------+
|country |population|
+-------------+----------+
|United States|2716000 |
|France |2206488 |
|Poland |1764615 |
|Sweden |972647 |
|Lithuania |580020 |
+-------------+----------+
答案 0 :(得分:1)
试试这个-
val inputDf = Seq(
("Warsaw", "Poland", "1 764 615"),
("Cracow", "Poland", "769 498"),
("Paris", "France", "2 206 488"),
("Villeneuve-Loubet", "France", "15 020"),
("Pittsburgh PA", "United States", "302 407"),
("Chicago IL", "United States", "2 716 000"),
("Milwaukee WI", "United States", "595 351"),
("Vilnius", "Lithuania", "580 020"),
("Stockholm", "Sweden", "972 647"),
("Goteborg", "Sweden", "580 020")
).toDF("name", "country", "population")
println("Input:")
inputDf.show(false)
/**
* Input:
* +-----------------+-------------+----------+
* |name |country |population|
* +-----------------+-------------+----------+
* |Warsaw |Poland |1 764 615 |
* |Cracow |Poland |769 498 |
* |Paris |France |2 206 488 |
* |Villeneuve-Loubet|France |15 020 |
* |Pittsburgh PA |United States|302 407 |
* |Chicago IL |United States|2 716 000 |
* |Milwaukee WI |United States|595 351 |
* |Vilnius |Lithuania |580 020 |
* |Stockholm |Sweden |972 647 |
* |Goteborg |Sweden |580 020 |
* +-----------------+-------------+----------+
*/
val topPopulation = inputDf
.withColumn("population", regexp_replace($"population", " ", "").cast("Integer"))
.withColumn("population_name", struct($"population", $"name"))
.groupBy("country")
.agg(max("population_name").as("population_name"))
.selectExpr("country", "population_name.*")
topPopulation.show(false)
topPopulation.printSchema()
/**
* +-------------+----------+----------+
* |country |population|name |
* +-------------+----------+----------+
* |France |2206488 |Paris |
* |Poland |1764615 |Warsaw |
* |Lithuania |580020 |Vilnius |
* |Sweden |972647 |Stockholm |
* |United States|2716000 |Chicago IL|
* +-------------+----------+----------+
*
* root
* |-- country: string (nullable = true)
* |-- population: integer (nullable = true)
* |-- name: string (nullable = true)
*/