Question

我正在查看开发代码，我需要避免或使用在数据框中使用“ withColumn”功能添加列的其他方法；但我有以下疑问：

使用嵌套的“ withColumn”创建新表（如下代码）？使用6个'withColumn'，在内存表中创建6个新表？

newDataframe = table
.withColumn("name", col("consolidate").cast(DecimalType(17,2)))
.withColumn("name", col("consolidate").cast(DecimalType(17,2)))

如果使用许多'withColumn'会增加内存使用率并降低性能（如果为true），那么在向数据帧中添加列并获得相同结果时如何避免使用'withColumn'？
< / li>
有没有一种方法可以消耗更少的内存，并且在不使用'withColumn'的情况下可以更快地运行，但是得到的结果相同？即，添加了6列的数据框

我不知道该怎么做。

要优化的代码如下：

def myMethod(table: DataFrame): DataFrame = {
    newDataframe = table
      .withColumn("name", col("consolidate").cast(DecimalType(17,2)))
      .withColumn("id_value", col("east").cast(DecimalType(17,2)))
      .withColumn("x_value", col("daily").cast(DecimalType(17,2)))
      .withColumn("amount", col("paid").cast(DecimalType(17,2)))
      .withColumn("client", col("lima").cast(DecimalType(17,2)))
      .withColumn("capital", col("econo").cast(DecimalType(17,2)))
    newDataframe
  }

Answer 1

这里存在一个误解：Spark不会在内存中创建6个中间数据集。实际上，由于只有在调用动作（例如withColumn或.count()之后才对Spark转换（例如.show()）进行延迟计算，因此您的函数不会触发任何内存更改。 / p>

调用该操作时，Spark会优化您的转换并立即完成所有转换，因此在内存方面调用6次.withColumn没问题。

替代在Spark Scala中使用'withColumn'函数

1 个答案: