根据我的要求,从蜂巢中读取表格(大小 - 大约1 TB)我必须做太多的聚合操作,主要是avg&和。 我尝试使用code.its运行很长时间。是否有另一种方法来优化或有效处理多个agg操作
finalDF.groupBy($"Dseq", $"FmNum", $"yrs",$"mnt",$"FromDnsty")
.agg(count($"Dseq"),avg($"Emp"),avg($"Ntw"),avg($"Age"),avg($"DAll"),avg($"PAll"),avg($"DSum"),avg($"dol"),
avg($"neg"),avg($"Rd"),avg("savg"),avg("slavg"),avg($"dex"),avg("cur"),avg($"Nexp"), avg($"NExpp"),avg($"Psat"),
avg($"Pexps"),avg($"Pxn"),avg($"Pn"),avg($"AP3"),avg($"APd"),avg($"RInd"),avg($"CP"),avg($"CScr"),
avg($"Fspct7p1"), avg($"Fspts7p1"),avg($"TlpScore"),avg($"Ordrs"),avg($"Drs"),
avg("Lns"),avg("Judg"),avg("ds"),
avg("ob"),sum("Ss"),sum("dol"),sum("liens"),sum("pct"),
sum("jud"),sum("sljd"),sum("pNB"),avg("pctt"),sum($"Dolneg"),sum("Ls"),sum("sl"),sum($"PA"),sum($"DS"),
sum($"DA"),sum("dcur"),sum($"sat"),sum($"Pes"),sum($"Pn"),sum($"Pn"),sum($"Dlo"),sum($"Dol"),sum("pdol"),sum("pct"),sum("judg"))
注意 - 我正在使用Spark Scala