输入文件包含如下所示的行(州,城市,人口):
west bengal,kolkata,150000
karnataka,bangalore,200000
karnataka,mangalore,80000
west bengal,bongaon,50000
delhi,new delhi,100000
delhi,gurgaon,200000
我必须在Python和Scala中编写一个Spark(Apache Spark)程序来查找人口最多的城市。输出将是这样的:
west bengal,kolkata,150000
karnataka,bangalore,200000
delhi,new delhi,100000
所以我需要为每个州提供三列输出。我很容易得到这样的输出:
west bengal,15000
karnataka,200000
delhi,100000
但是要让城市拥有最多的人口变得越来越困难。
答案 0 :(得分:1)
在vanilla pyspark
中,将数据映射到状态为键的RDD对,值为元组(城市,人口)。然后reduceByKey
保留最大的城市。请注意,对于人口相同的城市,它会保留它遇到的第一个城市。
rdd.map(lambda reg: (reg[0],[reg[1],reg[2]]))
.reduceByKey(lambda v1,v2: ( v1 if v1[1] >= v2[1] else v2))
您的数据结果如下所示:
[('delhi', ['gurgaon', 200000]),
('west bengal', ['kolkata', 150000]),
('karnataka', ['bangalore', 200000])]
答案 1 :(得分:0)
这应该可以解决问题:
>>> sc = spark.sparkContext
>>> rdd = sc.parallelize([
['west bengal','kolkata',150000],
['karnataka','bangalore',200000],
['karnataka','mangalore',80000],
['west bengal','bongaon',50000],
['delhi','new delhi',100000],
['delhi','gurgaon',200000],
])
>>> df = rdd.toDF(['state','city','population'])
>>> df.show()
+-----------+---------+----------+
| state| city|population|
+-----------+---------+----------+
|west bengal| kolkata| 150000|
| karnataka|bangalore| 200000|
| karnataka|mangalore| 80000|
|west bengal| bongaon| 50000|
| delhi|new delhi| 100000|
| delhi| gurgaon| 200000|
+-----------+---------+----------+
>>> df.groupBy('city').max('population').show()
+---------+---------------+
| city|max(population)|
+---------+---------------+
|bangalore| 200000|
| kolkata| 150000|
| gurgaon| 200000|
|mangalore| 80000|
|new delhi| 100000|
| bongaon| 50000|
+---------+---------------+
>>> df.groupBy('state').max('population').show()
+-----------+---------------+
| state|max(population)|
+-----------+---------------+
| delhi| 200000|
|west bengal| 150000|
| karnataka| 200000|
+-----------+---------------+