我有很多文件包含大约60.000.000行。我的所有文件都采用{timestamp}#{producer}#{messageId}#{data_bytes}\n
我逐个遍历我的文件,并且还希望为每个输入文件构建一个输出文件。 因为有些行依赖于前面的行,所以我将它们按生产者分组。每当一条线依赖于一条或多条先前的线时,它们的生产者总是相同的。 在对所有行进行分组后,我将它们提供给我的Java解析器。 然后,解析器将包含内存中的所有已解析数据对象,然后将其作为JSON输出。
为了想象我如何处理我的工作,我将以下"流程图"汇总在一起。请注意,我没有想象groupByKey
- Shuffeling-Process
我的问题:
Container killed by YARN for exceeding memory limits. 7.6 GB of 7.5 GB physical memory used
我的问题:
我认为我的地图转换val lineMap = lines.map ...
(请参阅下面的Scala代码)会产生一个分区的rdd。因此,我希望在调用第二个map任务之前,以某种方式拆分rdd的值。
此外,我认为在此rdd lineMap
上调用saveAsTextFile将生成一个输出任务,该任务在我的每个map任务完成后运行。如果我的假设是正确的,为什么我的执行者仍然会耗尽内存? Spark是否会执行多个(太)大文件拆分并同时处理它们,这会导致Parser填满内存?
lineMap
rdd为我的Parser获得更多(更小)的输入是一个好主意吗? Scala代码(我遗漏了不相关的代码部分):
def main(args: Array[String]) {
val inputFilePath = args(0)
val outputFilePath = args(1)
val inputFiles = fs.listStatus(new Path(inputFilePath))
inputFiles.foreach( filename => {
processData(filename.getPath, ...)
})
}
def processData(filePath: Path, ...) {
val lines = sc.textFile(filePath.toString())
val lineMap = lines.map(line => (line.split(" ")(1), line)).groupByKey()
val parsedLines = lineMap.map{ case(key, values) => parseLinesByKey(key, values, config) }
//each output should be saved separately
parsedLines.saveAsTextFile(outputFilePath.toString() + "/" + filePath.getName)
}
def parseLinesByKey(key: String, values: Iterable[String], config : Config) = {
val importer = new LogFileImporter(...)
importer.parseData(values.toIterator.asJava, ...)
//importer from now contains all parsed data objects in memory that could be parsed
//from the given values.
val jsonMapper = getJsonMapper(...)
val jsonStringData = jsonMapper.getValueFromString(importer.getDataObject)
(key, jsonStringData)
}
答案 0 :(得分:0)
我通过删除groupByKey调用并实现新的FileInputFormat以及RecordReader来解决这个问题,以消除线条依赖于其他线路的限制。现在,我实现了它,以便每个拆分包含前一个拆分的50.000字节开销。这将确保可以正确解析依赖于前一行的所有行。
我现在将继续查看前一个分割的最后50.000个字节,但只复制实际影响当前分割的解析的行。因此,我最大限度地减少了开销,仍然可以获得高度可并行化的任务。
以下链接将我拉向了正确的方向。因为FileInputFormat / RecordReader的主题一见钟情(至少对我而言),所以阅读这些文章并了解它是否适合您的问题是很好的:
ae.be 文章中的相关代码部分,以防网站出现故障。作者(@Gurdt)使用它来检测聊天消息是否包含转义的行返回(通过使行以“\”结尾)并将转义的行附加在一起,直到找到未转义的\ n。这将允许他检索跨越两行或更多行的消息。用Scala编写的代码:
用法
val conf = new Configuration(sparkContext.hadoopConfiguration)
val rdd = sparkContext.newAPIHadoopFile("data.txt", classOf[MyFileInputFormat],
classOf[LongWritable], classOf[Text], conf)
FileInputFormat
class MyFileInputFormat extends FileInputFormat[LongWritable, Text] {
override def createRecordReader(split: InputSplit, context: TaskAttemptContext):
RecordReader[LongWritable, Text] = new MyRecordReader()
}
RecordReader
class MyRecordReader() extends RecordReader[LongWritable, Text] {
var start, end, position = 0L
var reader: LineReader = null
var key = new LongWritable
var value = new Text
override def initialize(inputSplit: InputSplit, context: TaskAttemptContext): Unit = {
// split position in data (start one byte earlier to detect if
// the split starts in the middle of a previous record)
val split = inputSplit.asInstanceOf[FileSplit]
start = 0.max(split.getStart - 1)
end = start + split.getLength
// open a stream to the data, pointing to the start of the split
val stream = split.getPath.getFileSystem(context.getConfiguration)
.open(split.getPath)
stream.seek(start)
reader = new LineReader(stream, context.getConfiguration)
// if the split starts at a newline, we want to start yet another byte
// earlier to check if the newline was escaped or not
val firstByte = stream.readByte().toInt
if(firstByte == '\n')
start = 0.max(start - 1)
stream.seek(start)
if(start != 0)
skipRemainderFromPreviousSplit(reader)
}
def skipRemainderFromPreviousSplit(reader: LineReader): Unit = {
var readAnotherLine = true
while(readAnotherLine) {
// read next line
val buffer = new Text()
start += reader.readLine(buffer, Integer.MAX_VALUE, Integer.MAX_VALUE)
pos = start
// detect if delimiter was escaped
readAnotherLine = buffer.getLength >= 1 && // something was read
buffer.charAt(buffer.getLength - 1) == '\\' && // newline was escaped
pos <= end // seek head hasn't passed the split
}
}
override def nextKeyValue(): Boolean = {
key.set(pos)
// read newlines until an unescaped newline is read
var lastNewlineWasEscaped = false
while (pos < end || lastNewlineWasEscaped) {
// read next line
val buffer = new Text
pos += reader.readLine(buffer, Integer.MAX_VALUE, Integer.MAX_VALUE)
// append newly read data to previous data if necessary
value = if(lastNewlineWasEscaped) new Text(value + "\n" + buffer) else buffer
// detect if delimiter was escaped
lastNewlineWasEscaped = buffer.charAt(buffer.getLength - 1) == '\\'
// let Spark know that a key-value pair is ready!
if(!lastNewlineWasEscaped)
return true
}
// end of split reached?
return false
}
}
注意:您可能还需要在RecordReader中实现getCurrentKey,getCurrentValue,close和getProgress。