查找每个国家/地区人口最多的城市

时间:2020-06-15 14:09:18

标签: scala apache-spark

我需要编写代码,以提供每个国家人口最多的城市。 这是输入数据:

DataFrame = {
/** Input data */
val inputDf = Seq(
  ("Warsaw", "Poland", "1 764 615"),
  ("Cracow", "Poland", "769 498"),
  ("Paris", "France", "2 206 488"),
  ("Villeneuve-Loubet", "France", "15 020"),
  ("Pittsburgh PA", "United States", "302 407"),
  ("Chicago IL", "United States", "2 716 000"),
  ("Milwaukee WI", "United States", "595 351"),
  ("Vilnius", "Lithuania", "580 020"),
  ("Stockholm", "Sweden", "972 647"),
  ("Goteborg", "Sweden", "580 020")
).toDF("name", "country", "population")
println("Input:")
inputDf.show(false)

我的解决方法是:

 val topPopulation = inputDf
  //        .select("name", "country", "population")
  .withColumn("population", regexp_replace($"population", " ", "").cast("Integer"))

  //      .agg(max($"population").alias(("population")))
  //        .withColumn("population", regexp_replace($"population", " ", "").cast("Integer"))
  //        .withColumn("country", $"country")
  //        .withColumn("name", $"name")
  //          .cast("Integer")
  .groupBy("country")
  .agg(
    max("population").alias("population")
  )
  .orderBy($"population".desc)
//      .orderBy("max(population)")
topPopulation

但是我很麻烦,因为“只能在具有相同列数的表上执行,但是第一个表有2列,第二个表有3列; “

输入:

+-----------------+-------------+----------+
|name             |country      |population|
+-----------------+-------------+----------+
|Warsaw           |Poland       |1 764 615 |
|Cracow           |Poland       |769 498   |
|Paris            |France       |2 206 488 |
|Villeneuve-Loubet|France       |15 020    |
|Pittsburgh PA    |United States|302 407   |
|Chicago IL       |United States|2 716 000 |
|Milwaukee WI     |United States|595 351   |
|Vilnius          |Lithuania    |580 020   |
|Stockholm        |Sweden       |972 647   |
|Goteborg         |Sweden       |580 020   |
+-----------------+-------------+----------+

预期:

+----------+-------------+----------+
|name      |country      |population|
+----------+-------------+----------+
|Warsaw    |Poland       |1 764 615 |
|Paris     |France       |2 206 488 |
|Chicago IL|United States|2 716 000 |
|Vilnius   |Lithuania    |580 020   |
|Stockholm |Sweden       |972 647   |
+----------+-------------+----------+

实际:

+-------------+----------+
|country      |population|
+-------------+----------+
|United States|2716000   |
|France       |2206488   |
|Poland       |1764615   |
|Sweden       |972647    |
|Lithuania    |580020    |
+-------------+----------+

1 个答案:

答案 0 :(得分:1)

试试这个-

加载测试数据

  val inputDf = Seq(
      ("Warsaw", "Poland", "1 764 615"),
      ("Cracow", "Poland", "769 498"),
      ("Paris", "France", "2 206 488"),
      ("Villeneuve-Loubet", "France", "15 020"),
      ("Pittsburgh PA", "United States", "302 407"),
      ("Chicago IL", "United States", "2 716 000"),
      ("Milwaukee WI", "United States", "595 351"),
      ("Vilnius", "Lithuania", "580 020"),
      ("Stockholm", "Sweden", "972 647"),
      ("Goteborg", "Sweden", "580 020")
    ).toDF("name", "country", "population")
    println("Input:")
    inputDf.show(false)
    /**
      * Input:
      * +-----------------+-------------+----------+
      * |name             |country      |population|
      * +-----------------+-------------+----------+
      * |Warsaw           |Poland       |1 764 615 |
      * |Cracow           |Poland       |769 498   |
      * |Paris            |France       |2 206 488 |
      * |Villeneuve-Loubet|France       |15 020    |
      * |Pittsburgh PA    |United States|302 407   |
      * |Chicago IL       |United States|2 716 000 |
      * |Milwaukee WI     |United States|595 351   |
      * |Vilnius          |Lithuania    |580 020   |
      * |Stockholm        |Sweden       |972 647   |
      * |Goteborg         |Sweden       |580 020   |
      * +-----------------+-------------+----------+
      */

查找该国家人口最多的城市


    val topPopulation = inputDf
      .withColumn("population", regexp_replace($"population", " ", "").cast("Integer"))
      .withColumn("population_name", struct($"population", $"name"))
      .groupBy("country")
      .agg(max("population_name").as("population_name"))
      .selectExpr("country", "population_name.*")
    topPopulation.show(false)
    topPopulation.printSchema()

    /**
      * +-------------+----------+----------+
      * |country      |population|name      |
      * +-------------+----------+----------+
      * |France       |2206488   |Paris     |
      * |Poland       |1764615   |Warsaw    |
      * |Lithuania    |580020    |Vilnius   |
      * |Sweden       |972647    |Stockholm |
      * |United States|2716000   |Chicago IL|
      * +-------------+----------+----------+
      *
      * root
      * |-- country: string (nullable = true)
      * |-- population: integer (nullable = true)
      * |-- name: string (nullable = true)
      */