Question

我有一个宽数据帧（130000行x 8700列），当我尝试对所有列求和时，我收到以下错误：

线程“main”java.lang.StackOverflowError中的异常在scala.collection.generic.Growable $$ anonfun $$加上$ plus $ eq $ 1.apply（Growable.scala：59）在scala.collection.generic.Growable $$ anonfun $$加上$ plus $ eq $ 1.apply（Growable.scala：59）在scala.collection.IndexedSeqOptimized $ class.foreach（IndexedSeqOptimized.scala：33）在scala.collection.mutable.WrappedArray.foreach（WrappedArray.scala：35）在scala.collection.generic.Growable $ class。$ plus $ plus $ eq（Growable.scala：59）在scala.collection.mutable.ListBuffer。$ plus $ plus $ eq（ListBuffer.scala：183）在scala.collection.mutable.ListBuffer。$ plus $ plus $ eq（ListBuffer.scala：45）在scala.collection.generic.GenericCompanion.apply（GenericCompanion.scala：49）在org.apache.spark.sql.catalyst.expressions.BinaryExpression.children（Expression.scala：400）在org.apache.spark.sql.catalyst.trees.TreeNode.containsChild $ lzycompute（TreeNode.scala：88） ...

这是我的Scala代码：

  val df = spark.read
    .option("header", "false")
    .option("delimiter", "\t")
    .option("inferSchema", "true")
    .csv("D:\\Documents\\Trabajo\\Fábregas\\matrizLuna\\matrizRelativa")


  val arrayList = df.drop("cups").columns
  var colsList = List[Column]()
  arrayList.foreach { c => colsList :+= col(c) }

  val df_suma = df.withColumn("consumo_total", colsList.reduce(_ + _))

如果我对几列做同样的事情它可以正常工作，但是当我尝试使用大量列的reduce操作时，我总是得到相同的错误。

任何人都可以建议我该怎么办？列数有限制吗？

THX！

Answer 1

您可以使用另一种缩小方法来生成深度为O(log(n))的平衡二叉树，而不是退化的线性化BinaryExpression深度链O(n)：

def balancedReduce[X](list: List[X])(op: (X, X) => X): X = list match {
  case Nil => throw new IllegalArgumentException("Cannot reduce empty list")
  case List(x) => x
  case xs => {
    val n = xs.size
    val (as, bs) = list.splitAt(n / 2)
    op(balancedReduce(as)(op), balancedReduce(bs)(op))
  }
}

现在在您的代码中，您可以替换

colsList.reduce(_ + _)

通过

balancedReduce(colsList)(_ + _)

一个小例子来进一步说明BinaryExpression s会发生什么，可编译而没有任何依赖：

sealed trait FormalExpr
case class BinOp(left: FormalExpr, right: FormalExpr) extends FormalExpr {
  override def toString: String = {
    val lStr = left.toString.split("\n").map("  " + _).mkString("\n")
    val rStr = right.toString.split("\n").map("  " + _).mkString("\n")
    return s"BinOp(\n${lStr}\n${rStr}\n)"
  }
}
case object Leaf extends FormalExpr

val leafs = List.fill[FormalExpr](16){Leaf}

println(leafs.reduce(BinOp(_, _)))
println(balancedReduce(leafs)(BinOp(_, _)))

这是普通的reduce所做的（这是你代码中实际发生的事情）：

这是balancedReduce产生的结果：

BinOp(
  BinOp(
    BinOp(
      BinOp(
        Leaf
        Leaf
      )
      BinOp(
        Leaf
        Leaf
      )
    )
    BinOp(
      BinOp(
        Leaf
        Leaf
      )
      BinOp(
        Leaf
        Leaf
      )
    )
  )
  BinOp(
    BinOp(
      BinOp(
        Leaf
        Leaf
      )
      BinOp(
        Leaf
        Leaf
      )
    )
    BinOp(
      BinOp(
        Leaf
        Leaf
      )
      BinOp(
        Leaf
        Leaf
      )
    )
  )
)

线性化链的长度为O(n)，当Catalyst试图评估它时，它会吹掉堆栈。深度为O(log(n))的扁平树不应该发生这种情况。

虽然我们正在讨论渐近运行时：为什么要附加一个可变的colsList？这需要O(n^2)时间。为什么不简单地在toList的输出上调用.columns？

在Spark中使用大量列操作时出现StackOverflowError

1 个答案: