Question

我正在运行Pyspark脚本，将数据帧写入jupyter Notebook中的csv，如下所示：

df.coalesce(1).write.csv('Data1.csv',header = 'true')

运行一个小时后，出现以下错误。

Error: Invalid status code from http://.....session isn't active.

我的配置就像：

spark.conf.set("spark.dynamicAllocation.enabled","true")
spark.conf.set("shuffle.service.enabled","true")
spark.conf.set("spark.dynamicAllocation.minExecutors",6)
spark.conf.set("spark.executor.heartbeatInterval","3600s")
spark.conf.set("spark.cores.max", "4")
spark.conf.set("spark.sql.tungsten.enabled", "true")
spark.conf.set("spark.eventLog.enabled", "true")
spark.conf.set("spark.app.id", "Logs")
spark.conf.set("spark.io.compression.codec", "snappy")
spark.conf.set("spark.rdd.compress", "true")
spark.conf.set("spark.executor.instances", "6")
spark.conf.set("spark.executor.memory", '20g')
spark.conf.set("hive.exec.dynamic.partition", "true")
spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
spark.conf.set("spark.driver.allowMultipleContexts", "true")
spark.conf.set("spark.master", "yarn")
spark.conf.set("spark.driver.memory", "20G")
spark.conf.set("spark.executor.instances", "32")
spark.conf.set("spark.executor.memory", "32G")
spark.conf.set("spark.driver.maxResultSize", "40G")
spark.conf.set("spark.executor.cores", "5")

我已经检查了容器节点，并且存在错误：

ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container marked as failed:container_e836_1556653519610_3661867_01_000005 on host: ylpd1205.kmdc.att.com. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143

无法找出问题所在。

Answer 1

根据输出判断，如果您的应用程序未以FAILED状态完成，则听起来像是Livy超时错误：您的应用程序可能比Livy会话所定义的超时时间更长（默认为1h），所以即使尽管Spark应用程序成功运行，但如果该应用程序花费的时间比Livy会话的超时时间长，则您的笔记本仍会收到此错误。

如果是这种情况，请按以下步骤处理：

编辑/etc/livy/conf/livy.conf文件（在集群的主目录中节点）
将livy.server.session.timeout设置为较高的值，例如8h（或更大，取决于您的应用程序）
重新启动Livy以更新设置：群集主服务器中的sudo restart livy-server
再次测试您的代码

Answer 2

我不太了解pyspark，但在Scala中，解决方案将涉及到此类

首先，我们需要创建一种用于创建头文件的方法。所以会是这样

def createHeaderFile(headerFilePath: String, colNames: Array[String]) {

//format header file path
val fileName = "dfheader.csv"
val headerFileFullName = "%s/%s".format(headerFilePath, fileName)

//write file to hdfs one line after another
val hadoopConfig = new Configuration()
val fileSystem = FileSystem.get(hadoopConfig)
val output = fileSystem.create(new Path(headerFileFullName))
val writer = new PrintWriter(output)

for (h <- colNames) {
  writer.write(h + ",")
}
writer.write("\n")
writer.close()

}

您还将需要一个用于调用hadoop来合并由df.write方法编写的零件文件的方法，因此将是这样

  def mergeOutputFiles(sourcePaths: String, destLocation: String): Unit = {

val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
// in case of array[String] use   for loop to iterate over the muliple source paths  if not use the code below 
//   for (sourcePath <- sourcePaths) {
  //Get the path under destination where the partitioned files are temporarily stored
  val pathText = sourcePaths.split("/")
  val destPath = "%s/%s".format(destLocation, pathText.last)

  //Merge files into 1
  FileUtil.copyMerge(hdfs, new Path(sourcePath), hdfs, new Path(destPath), true, hadoopConfig, null)
 // }
//delete the temp partitioned files post merge complete
val tempfilesPath = "%s%s".format(destLocation, tempOutputFolder)
hdfs.delete(new Path(tempfilesPath), true)

}

这是一种用于生成输出文件的方法，或者是df.write方法，您将要写入的巨大DF传递给hadoop HDFS

  def generateOutputFiles( processedDf: DataFrame, opPath: String, tempOutputFolder: String,
                       spark: SparkSession): String = {

import spark.implicits._

  val fileName = "%s%sNameofyourCsvFile.csv".format(opPath, tempOutputFolder)
  //write as csv to output directory and add file path to array to be sent for merging and create header file
  processedDf.write.mode("overwrite").csv(fileName)

  createHeaderFile(fileName, processedDf.columns)
  //create an array of the partitioned file paths

   outputFilePathList = fileName

  // you can use array of string or string only depending on  if the output needs to get divided in multiple file based on some parameter  in that case chagne the signature ot Array[String] as output
  // add below code 
 // outputFilePathList(counter) = fileName
  // just use a loop in the above  and increment it 
  //counter += 1


return outputFilePathList

}

使用此处定义的所有方法，您都可以实现它们

def processyourlogic( your parameters  if any):Dataframe=
{
.... your logic to do whatever needs to be done to your data
}

假设上述方法返回一个数据帧，这是将所有内容放在一起的方法

val yourbigDf= processyourlogic(your parameters) // returns DF
yourbigDf.cache // caching just in case you need it 
val outputPathFinal=" location where you want your file to be saved"
 val tempOutputFolderLocation = "temp/"
val partFiles= generateOutputFiles( yourbigDf, outputPathFinal,tempOutputFolderLocation ,spark)
mergeOutputFiles(partFiles, outputPathFinal)

让我知道您是否还有其他与此有关的问题。理想情况下，这需要单独解决。如果您寻找的答案与原始问题不同，请继续提出一个新问题。

来自..错误有效负载的无效状态代码“ 400”：“要求失败：会话无效

2 个答案: