Question

我有一个包含1340列的巨大数据框。在深入建模之前，我必须摆脱没有明显价值的专栏。我发现执行此操作的几种方法需要对数据框进行操作，即，这需要花费很多时间（约75小时）。如何仅使用 Transformations 来解决此问题以节省大量时间？

我正在使用运行Apache Spark 2.4.0和Python 3.5的Azure Databricks。
群集规格：
-工作人员：56 GB内存，16核
-驱动程序：56 GB内存，16核
2-8个节点（自动缩放）

from pyspark.sql.functions import *
# This shouldn't be run

cols_to_drop = []

for c in df.columns:
  # Extracting the value computed by countDistinct()
  # Here collect() is time-consuming because it's an action
  if df.agg(countDistinct(c)).collect()[0][0] < 2:
    print("{} has no distinct values.".format(c))
    cols_to_drop.append(c)

print(len(cols_to_drop))
df = df.drop(*cols_to_drop)

我还尝试使用了roximate_count_distinct，它应该更快，估计误差> 0.01。但是它并没有太大变化，而且通常更长。

我想要做相同的事情-删除没有不同值的列-没有暗示操作的函数，例如collect（）。

修改：
不建议将其用于大型数据集，但无论如何我还是这样做的。使用toPandas（）将我的数据框转换为熊猫数据框。花了10分钟，这相当不错。然后就可以了：

cols_to_drop = [c for c in f.columns if len(df[c].unique()) < 2]

Pyspark：仅使用转换删除没有不同值的列

0 个答案: