我有一个以hdfs存储的CSV文件,格式如下:
Business Line,Requisition (Job Title),Year,Month,Actual (# of Days)
Communications,1012_Com_Specialist,2017,February,150
Information Technology,5781_Programmer_Associate,2017,March,80
Information Technology,2497_Programmer_Senior,2017,March,120
Services,6871_Business_Analyst_Jr,2018,May,33
我想按年份和月份获得实际平均值(天数)。有人可以请帮助我如何使用Pyspark执行此操作并将输出保存在Parquet文件中?
答案 0 :(得分:0)
你可以将csv转换为DF并运行spark-sql,如下所示:
da.map_blocks