应用错误收集

我有带有13个分类字符串列的数据框。我想用以下代码计算每列上的计数distinch值。我有68 GB的文件，共有68951428行。不幸的是，该过程太慢了（至少持续2个多小时）。我该如何解决这种情况。

 distvals = clickDF.agg(*(approx_count_distinct(col(c)).alias(c) for c in clickDF.columns if str.startswith(c,"c")))

我的提交也如下：spark-submit --master yarn --deploy-mode client -num-executors 18 --executor-cores 10 --executor-memory 19g ----conf spark.yarn.executor.memoryOverhead=5120 click_spark_cluster.py

火花中的about_count_distinct太慢

0 个答案: