Pyspark - 按年和月的平均天数

时间:2018-06-12 18:28:13

标签: pyspark apache-spark-sql hdfs rdd parquet

我有一个以hdfs存储的CSV文件,格式如下:

Business Line,Requisition (Job Title),Year,Month,Actual (# of Days)
Communications,1012_Com_Specialist,2017,February,150
Information Technology,5781_Programmer_Associate,2017,March,80
Information Technology,2497_Programmer_Senior,2017,March,120
Services,6871_Business_Analyst_Jr,2018,May,33

我想按年份和月份获得实际平均值(天数)。有人可以请帮助我如何使用Pyspark执行此操作并将输出保存在Parquet文件中?

1 个答案:

答案 0 :(得分:0)

你可以将csv转换为DF并运行spark-sql,如下所示:

da.map_blocks