找到人口最多的城市的火花计划

时间:2016-11-14 14:33:02

标签: apache-spark pyspark bigdata

输入文件包含如下所示的行(州,城市,人口):

west bengal,kolkata,150000
karnataka,bangalore,200000
karnataka,mangalore,80000
west bengal,bongaon,50000
delhi,new delhi,100000
delhi,gurgaon,200000

我必须在Python和Scala中编写一个Spark(Apache Spark)程序来查找人口最多的城市。输出将是这样的:

west bengal,kolkata,150000
karnataka,bangalore,200000
delhi,new delhi,100000

所以我需要为每个州提供三列输出。我很容易得到这样的输出:

west bengal,15000
karnataka,200000
delhi,100000

但是要让城市拥有最多的人口变得越来越困难。

2 个答案:

答案 0 :(得分:1)

在vanilla pyspark中,将数据映射到状态为键的RDD对,值为元组(城市,人口)。然后reduceByKey保留最大的城市。请注意,对于人口相同的城市,它会保留它遇到的第一个城市。

rdd.map(lambda reg: (reg[0],[reg[1],reg[2]])) .reduceByKey(lambda v1,v2: ( v1 if v1[1] >= v2[1] else v2))

您的数据结果如下所示:

[('delhi', ['gurgaon', 200000]), ('west bengal', ['kolkata', 150000]), ('karnataka', ['bangalore', 200000])]

答案 1 :(得分:0)

这应该可以解决问题:

>>> sc = spark.sparkContext
>>> rdd = sc.parallelize([
    ['west bengal','kolkata',150000],
    ['karnataka','bangalore',200000],
    ['karnataka','mangalore',80000],
    ['west bengal','bongaon',50000],
    ['delhi','new delhi',100000],
    ['delhi','gurgaon',200000],
])

>>> df = rdd.toDF(['state','city','population'])
>>> df.show()
+-----------+---------+----------+
|      state|     city|population|
+-----------+---------+----------+
|west bengal|  kolkata|    150000|
|  karnataka|bangalore|    200000|
|  karnataka|mangalore|     80000|
|west bengal|  bongaon|     50000|
|      delhi|new delhi|    100000|
|      delhi|  gurgaon|    200000|
+-----------+---------+----------+


>>> df.groupBy('city').max('population').show()
+---------+---------------+
|     city|max(population)|
+---------+---------------+
|bangalore|         200000|
|  kolkata|         150000|
|  gurgaon|         200000|
|mangalore|          80000|
|new delhi|         100000|
|  bongaon|          50000|
+---------+---------------+

>>> df.groupBy('state').max('population').show()
+-----------+---------------+
|      state|max(population)|
+-----------+---------------+
|      delhi|         200000|
|west bengal|         150000|
|  karnataka|         200000|
+-----------+---------------+