Question

我正在从DataFrame中的S3读取文件，将记录限制为100个。然后，我要在此DataFrame中添加大约10列。

我可以看到新的架构，它表明已添加了列：

但是当我对最终DataFrame（在这种情况下为existingDF）执行任何操作时，Spark作业将不会运行。

在Spark UI中，我看到的是：

collect at <console>:52

用于从S3读取文件。

作业本身不会终止并继续运行，但是Spark UI不会显示任何正在运行的东西。

我检查了日志，它显示所有执行程序都被一个接一个地删除了，并且在Spark UI中，它显示所有执行程序都死了，希望得到驱动程序。

另一个奇怪的是，在Spark UI中，驱动程序核心显示为0，但是每个执行程序都有4核心。我已经将驱动程序和执行程序都配置为具有4内核。

我尝试使用val代替var，并为每个操作显式创建了一个新的DataFrame，但结果相同。

从S3读取数据后，我还尝试将DataFrame持久保存在DISK上。

这是主要代码：

import org.apache.spark.sql.{DataFrame, _}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{DataType, StructType}
import org.apache.spark.storage.StorageLevel

import scala.collection.mutable.ListBuffer
import scala.util.Try

def columnExists(df: DataFrame, path: String): Boolean = Try(df(path)).isSuccess

var existingDF = sqlContext.read.format("parquet").option("basePath", s"$basePath").schema(
      patchedSchema).load(filePaths: _*).limit(100)

if (!columnExists(existingDF, "column_one.abc")) {
      existingDF = existingDF.withColumn(
        "column_one",
        struct(
          $"column_one.*",
          struct(
            lit(null).cast("long").as("first_val"),
            lit(null).cast("string").as("second_val")
          ).as("abc")
        )
      )

      println("Column added")
}

if (!columnExists(existingDF, "column_one.def")) {
      existingDF = existingDF.withColumn(
        "column_one",
        struct(
          $"column_one.*",
          struct(
            lit(null).cast("long").as("first_val"),
            lit(null).cast("string").as("second_val")
          ).as("def")
        )
      )

      println("Column added")
}

if (!columnExists(existingDF, "column_one.ghi")) {
      existingDF = existingDF.withColumn(
        "column_one",
        struct(
          $"column_one.*",
          struct(
            lit(null).cast("long").as("first_val"),
            lit(null).cast("string").as("second_val")
          ).as("ghi")
        )
      )

      println("Column added")
}

即使运行，

existingDF.explain

它不起作用。

如果我只添加1或2列，但添加第3列，一切都会开始。

我已经尝试过Zeppelin，它表明该参数正在运行，但从未完成，并且Spark UI中未显示任何内容。

我也尝试过从EMR控制台执行此操作，但是没有运气。

这是集群和Spark配置：

Master node (1) - 4 vCore, 8 GiB memory, EBS only storage. 32 GiB

Core nodes (5) - 16 vCore, 32 GiB memory, EBS only storage. 64 GiB


spark.submit.deployMode=cluster 
spark.executor.memory=20g
--num-executors=4 
--driver-memory=20g
spark.driver.cores=4
spark.executor.cores=4

我假设这不是内存问题，因为我只有100条记录，并且有足够的内存来容纳这些记录。

让我知道我需要添加更多有关任何细节。

向DataFrame添加新列时，Spark作业停顿了

0 个答案: