我正在阅读一个文本文件,它是固定宽度的文件,需要将其转换为csv。我的程序在本地计算机上运行良好,但是当我在群集上运行它时,会引发“任务不可序列化”异常。
我试图用map和mapPartition解决相同的问题。
通过在RDD上使用toLocalIterator可以正常工作。但这不适用于大文件(我有8GB的文件)
下面是使用我最近尝试过的mapPartition的代码
///读取源文件并创建RDD
def main(){
var inpData = sc.textFile(s3File)
LOG.info(s"\n inpData >>>>>>>>>>>>>>> [${inpData.count()}]")
val rowRDD = inpData.mapPartitions(iter=>{
var listOfRow = new ListBuffer[Row]
while(iter.hasNext){
var line = iter.next()
if(line.length() >= maxIndex){
listOfRow += getRow(line,indexList)
}else{
counter+=1
}
}
listOfRow.toIterator
})
rowRDD .foreach(println)
}
case class StartEnd(startingPosition: Int, endingPosition: Int) extends Serializable
def getRow(x: String, inst: List[StartEnd]): Row = {
val columnArray = new Array[String](inst.size)
for (f <- 0 to inst.size - 1) {
columnArray(f) = x.substring(inst(f).startingPosition, inst(f).endingPosition)
}
Row.fromSeq(columnArray)
}
///注:作为参考,我使用StartEnd案例类创建了indexList,创建后如下所示
[List(StartEnd(0,4), StartEnd(4,10), StartEnd(7,12), StartEnd(10,14))]
该程序在我的本地计算机上运行良好。但是,当我穿上cluster(AWS)时,它会抛出异常,如下所示。
17:24:10.947 [Driver] ERROR bms.edl.dt.transform.FileConversion.convertFixedWidthToCsv - Exception [Task not serializable]
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340) ~[glue-assembly.jar:?]
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330) ~[glue-assembly.jar:?]
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156) ~[glue-assembly.jar:?]
at org.apache.spark.SparkContext.clean(SparkContext.scala:2294) ~[glue-assembly.jar:?]
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:794) ~[glue-assembly.jar:?]
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793) ~[glue-assembly.jar:?]
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) ~[glue-assembly.jar:?]
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) ~[glue-assembly.jar:?]
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) ~[glue-assembly.jar:?]
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:793) ~[glue-assembly.jar:?]
Caused by: java.io.NotSerializableException: sun.nio.cs.UTF_8
Serialization stack:
- object not serializable (class: sun.nio.cs.UTF_8, value: UTF-8)
- field (class: org.apache.logging.log4j.core.layout.AbstractStringLayout, name: charset, type: class java.nio.charset.Charset)
- object (class org.apache.logging.log4j.core.layout.PatternLayout, %d{HH:mm:ss.SSS} [%t] %-5level %logger{5}.%M - %msg%n)
- field (class: org.apache.logging.log4j.core.appender.AbstractAppender, name: layout, type: interface org.apache.logging.log4j.core.Layout)
- object (class org.apache.logging.log4j.core.appender.ConsoleAppender, STDOUT)
- writeObject data (class: java.util.concurrent.ConcurrentHashMap)
- object (class java.util.concurrent.ConcurrentHashMap, {STDOUT=STDOUT})
- field (class: org.apache.logging.log4j.core.config.AbstractConfiguration, name: appenders, type: interface java.util.concurrent.ConcurrentMap)
- object (class org.apache.logging.log4j.core.config.xml.XmlConfiguration, XmlConfiguration[location=jar:file:/mnt/yarn/usercache/root/filecache/163/edl-dt-1.9-SNAPSHOT.jar!/log4j2.xml])
- field (class: org.apache.logging.log4j.core.LoggerContext, name: configuration, type: interface org.apache.logging.log4j.core.config.Configuration)
- object (class org.apache.logging.log4j.core.LoggerContext, org.apache.logging.log4j.core.LoggerContext@418bb61f)
- field (class: org.apache.logging.log4j.core.Logger, name: context, type: class org.apache.logging.log4j.core.LoggerContext)
- object (class org.apache.logging.log4j.core.Logger, com.bms.edl.dt.transform.FileConversion:TRACE in 681842940)
- field (class: com.bms.edl.dt.transform.FileConversion, name: LOG, type: interface org.apache.logging.log4j.Logger)
- object (class com.bms.edl.dt.transform.FileConversion, com.bms.edl.dt.transform.FileConversion@984ddbb)
- field (class: com.bms.edl.dt.transform.FileConversion$$anonfun$7, name: $outer, type: class com.bms.edl.dt.transform.FileConversion)
- object (class com.bms.edl.dt.transform.FileConversion$$anonfun$7, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) ~[glue-assembly.jar:?]
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) ~[glue-assembly.jar:?]
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) ~[glue-assembly.jar:?]
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:337) ~[glue-assembly.jar:?]
... 71 more
17:24:10.954 [Driver] TRACE bms.edl.dt.transform.FileConversion.convertFixedWidthToCsv - Exit
17:24:10.954 [Driver] INFO bms.edl.dt.transform.FileConversion.apply - counterMap>>>>>>>>>Map(ResultantDF -> [], ExceptionString ->
Exception occurred while applying the FileConversion transformation and the exception Message is :Task not serializable)
17:24:11.692 [Driver] INFO bms.edl.dt.transform.FileConversion.apply - df count >>>>>>>>>0
17:24:11.692 [Driver] INFO bms.edl.dt.transform.FileConversion.apply - THERE WAS AN EXCEPTION FIX WIDTHING
17:24:11.692 [Driver] INFO bms.edl.dt.transform.FileConversion.dataTransform - THERE WAS AN EXCEPTION -- sb is not empty
17:24:11.693 [Driver] TRACE bms.edl.dt.transform.FileConversion.dataTransform - Exit
17:24:11.693 [Driver] INFO bms.edl.dt.transform.FileConversion.dataTransform - result>>>>>>>>Map(ResultantDF -> [], ExceptionString ->
Exception occurred while applying the FileConversion transformation and the exception Message is :Task not serializable
Exception occurred while applying the FileConversion transformation and the exception Message is :Task not serializable)
17:24:11.693 [Driver] TRACE edl.core.services.reflection.ReflectionInvoker$.invokeDTMethod - Exit
我无法理解这里出了什么问题以及哪些内容无法序列化,为什么会引发异常。
感谢您的帮助。 预先感谢!
答案 0 :(得分:0)
您在晶石getRow
内调用mapPartition
方法。它迫使火花将您的主类实例传递给工人。主类包含LOG
作为字段。似乎此日志不支持序列化。
您可以
a)将getRow
中的LOG
移到另一个object
(解决此类问题的通用方法)
b)将LOG设为lazy val
c)使用另一个日志记录库