Question

我有很多文件包含大约60.000.000行。我的所有文件都采用{timestamp}#{producer}#{messageId}#{data_bytes}\n

格式

我逐个遍历我的文件，并且还希望为每个输入文件构建一个输出文件。因为有些行依赖于前面的行，所以我将它们按生产者分组。每当一条线依赖于一条或多条先前的线时，它们的生产者总是相同的。在对所有行进行分组后，我将它们提供给我的Java解析器。然后，解析器将包含内存中的所有已解析数据对象，然后将其作为JSON输出。

为了想象我如何处理我的工作，我将以下＆＃34;流程图＆＃34;汇总在一起。请注意，我没有想象groupByKey - Shuffeling-Process

我的问题：

我希望Spark能够拆分文件，使用单独的任务处理拆分，并将每个任务输出保存到＆＃34; part＆＃34; -file。
然而，我的任务耗尽内存并被YARN杀死，然后才能完成：Container killed by YARN for exceeding memory limits. 7.6 GB of 7.5 GB physical memory used
My Parser将所有已解析的数据对象抛出到内存中。我无法更改Parser的代码。
请注意我的代码适用于较小的文件（例如，每个600.000行的两个文件作为我的作业的输入）

我的问题：

如何确保Spark会为我的地图任务中的每个文件分割创建结果？（也许他们会在我的任务成功但我永远不会看到输出的时候。）
我认为我的地图转换val lineMap = lines.map ...（请参阅下面的Scala代码）会产生一个分区的rdd。因此，我希望在调用第二个map任务之前，以某种方式拆分rdd的值。

此外，我认为在此rdd lineMap上调用saveAsTextFile将生成一个输出任务，该任务在我的每个map任务完成后运行。如果我的假设是正确的，为什么我的执行者仍然会耗尽内存？ Spark是否会执行多个（太）大文件拆分并同时处理它们，这会导致Parser填满内存？
重新分区lineMap rdd为我的Parser获得更多（更小）的输入是一个好主意吗？
在某处还有一个我不知道的额外减速器步骤吗？喜欢在写入文件或类似文件之前聚合的结果？

Scala代码（我遗漏了不相关的代码部分）：

def main(args: Array[String]) {
    val inputFilePath = args(0)
    val outputFilePath = args(1)

    val inputFiles = fs.listStatus(new Path(inputFilePath))
    inputFiles.foreach( filename => {
        processData(filename.getPath, ...)
    }) 
}


def processData(filePath: Path, ...) {
    val lines  = sc.textFile(filePath.toString())
    val lineMap = lines.map(line => (line.split(" ")(1), line)).groupByKey()

    val parsedLines = lineMap.map{ case(key, values) => parseLinesByKey(key, values, config) }
    //each output should be saved separately
    parsedLines.saveAsTextFile(outputFilePath.toString() + "/" + filePath.getName)     
}


def parseLinesByKey(key: String, values: Iterable[String], config : Config) = {
    val importer = new LogFileImporter(...)
    importer.parseData(values.toIterator.asJava, ...)

    //importer from now contains all parsed data objects in memory that could be parsed 
    //from the given values.  

    val jsonMapper = getJsonMapper(...)
    val jsonStringData = jsonMapper.getValueFromString(importer.getDataObject)

    (key, jsonStringData)
}

Answer 1

我通过删除groupByKey调用并实现新的FileInputFormat以及RecordReader来解决这个问题，以消除线条依赖于其他线路的限制。现在，我实现了它，以便每个拆分包含前一个拆分的50.000字节开销。这将确保可以正确解析依赖于前一行的所有行。

我现在将继续查看前一个分割的最后50.000个字节，但只复制实际影响当前分割的解析的行。因此，我最大限度地减少了开销，仍然可以获得高度可并行化的任务。

以下链接将我拉向了正确的方向。因为FileInputFormat / RecordReader的主题一见钟情（至少对我而言），所以阅读这些文章并了解它是否适合您的问题是很好的：

ae.be 文章中的相关代码部分，以防网站出现故障。作者（@Gurdt）使用它来检测聊天消息是否包含转义的行返回（通过使行以“\”结尾）并将转义的行附加在一起，直到找到未转义的\ n。这将允许他检索跨越两行或更多行的消息。用Scala编写的代码：

用法

val conf = new Configuration(sparkContext.hadoopConfiguration)
val rdd = sparkContext.newAPIHadoopFile("data.txt", classOf[MyFileInputFormat],
classOf[LongWritable], classOf[Text], conf)

FileInputFormat

class MyFileInputFormat extends FileInputFormat[LongWritable, Text] {
    override def createRecordReader(split: InputSplit, context: TaskAttemptContext):
    RecordReader[LongWritable, Text] = new MyRecordReader()
}

RecordReader

class MyRecordReader() extends RecordReader[LongWritable, Text] {
    var start, end, position = 0L
    var reader: LineReader = null
    var key = new LongWritable
    var value = new Text

    override def initialize(inputSplit: InputSplit, context: TaskAttemptContext): Unit = {
        // split position in data (start one byte earlier to detect if
        // the split starts in the middle of a previous record)
        val split = inputSplit.asInstanceOf[FileSplit]
        start = 0.max(split.getStart - 1)
        end = start + split.getLength

        // open a stream to the data, pointing to the start of the split
        val stream = split.getPath.getFileSystem(context.getConfiguration)
        .open(split.getPath)
        stream.seek(start)
        reader = new LineReader(stream, context.getConfiguration)

        // if the split starts at a newline, we want to start yet another byte
        // earlier to check if the newline was escaped or not
        val firstByte = stream.readByte().toInt
        if(firstByte == '\n')
            start = 0.max(start - 1)
        stream.seek(start)

        if(start != 0)
            skipRemainderFromPreviousSplit(reader)
    }

    def skipRemainderFromPreviousSplit(reader: LineReader): Unit = {
        var readAnotherLine = true
        while(readAnotherLine) {
            // read next line
            val buffer = new Text()
            start += reader.readLine(buffer, Integer.MAX_VALUE, Integer.MAX_VALUE)
            pos = start

            // detect if delimiter was escaped
            readAnotherLine = buffer.getLength &gt;= 1 &amp;&amp; // something was read
            buffer.charAt(buffer.getLength - 1) == '\\' &amp;&amp; // newline was escaped
            pos &lt;= end // seek head hasn't passed the split
        }
    }

    override def nextKeyValue(): Boolean = {
        key.set(pos)

        // read newlines until an unescaped newline is read
        var lastNewlineWasEscaped = false
        while (pos &lt; end || lastNewlineWasEscaped) {
            // read next line
            val buffer = new Text
            pos += reader.readLine(buffer, Integer.MAX_VALUE, Integer.MAX_VALUE)

            // append newly read data to previous data if necessary
            value = if(lastNewlineWasEscaped) new Text(value + "\n" + buffer) else buffer

            // detect if delimiter was escaped
            lastNewlineWasEscaped = buffer.charAt(buffer.getLength - 1) == '\\'

            // let Spark know that a key-value pair is ready!
            if(!lastNewlineWasEscaped)
                return true
        }

        // end of split reached?
        return false
    }
}

注意：您可能还需要在RecordReader中实现getCurrentKey，getCurrentValue，close和getProgress。

Spark中Map Task中的大量内存消耗

1 个答案: