成为spark和python的新手,尝试一些基本的东西来打印员工数据的计数和最大值。
from pyspark.sql import Row
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
import pyspark.sql.functions as psf
spark = SparkSession \
.builder \
.appName("Hello") \
.config("World") \
.getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
df = spark.createDataFrame(
sc.textFile("employee.txt").map(lambda l: l.split('::')),
["employeeid","deptid","salary"]
)
df.registerTempTable("df")
mostEmpDept = sqlContext.sql("""select deptid, cntDept from (
select deptid, count(*) as cntDept, max(count(*)) over () as maxcnt
from df
group by deptid) as tmp
where tmp.cntDept = tmp.maxcnt""")
mostEmpDept.show()
以上代码导致给我的员工人数最多的deptid,如下所示
+-------+--------+
|deptid |cntDept |
+-------+--------+
| 10 | 7|
+-------+--------+
现在,我有另一个包含所有deptid及其名称的文件,如何将此结果映射到另一个文件并打印deptid 10名称?另一个文件如下所示
10::Marketing
20::Finance
30::HumanResource
40::HouseKeeping
答案 0 :(得分:2)
请使用以下内容:
sc = spark.sparkContext
sqlContext = SQLContext(sc)
df = spark.createDataFrame(
sc.textFile("employee.txt").map(lambda l: l.split('::')),
["employeeid","deptid","salary"]
)
df.registerTempTable("df")
dept = spark.createDataFrame(
sc.textFile("dept.txt").map(lambda l: l.split('::')),
["deptid","deptname"]
)
dept.registerTempTable("dept")
mostEmpDept = sqlContext.sql("""select deptid, cntDept from (
select deptid, count(*) as cntDept, max(count(*)) over () as maxcnt
from df
group by deptid) as tmp
where tmp.cntDept = tmp.maxcnt""")
mostEmpDept.registerTempTable('mostEmpDept')
final_df= sqlContext.sql("select a.deptid, b.deptname from mostEmpDept a inner join dept b on a.deptid=b.deptid")
final_df.show()
如果要保存,请使用
final_df.saveAsTextFile('Location')