使用以下代码从单个HDFS文件读取行,我正在使用here中所述的借用方法:
对于import org.apache.hadoop.fs._
,以下代码使用fs: FileSystem
和path: Path
。这些函数实际上是包装在类(定义文件系统fs
)中的方法。
/**
* @param param close-able method
* @param action data manipulation
* @return action's return value is bubbled up
*/
private def using[A <: { def close(): Unit }, B](param: A)(action: A => B): B =
try {
action(param)
} finally {
param.close()
}
/**
* Open - read - close file while returning file lines
*
* @param path where are file is stored
* @return array of lines in file
*/
def readFileByLine(path: Path): Array[String] = {
using(fs.open(path)) { fileInputStream => {
using(Source.fromInputStream(fileInputStream)) { bufferedSource => {
(for (line <- bufferedSource.getLines()) yield line).toArray
}
}
}
}
}
/**
* Open - read - close file while returning file lines
*
* @param path where are file is stored
* @return array of lines in file
*/
def readWholeFile(path: Path): Array[String] =
using(fs.open(path)) { inputStream => {
IOUtils.toString(inputStream, "UTF-8").split("\n")
}
}
这两种方法似乎要达到相同的目标需要做两种不同的事情-从HDFS文件中读取行并返回字符串数组。
鉴于这些文件很小,哪种读取方法将被视为标准Scala?这两种方法之间的权衡是什么?
已添加
我认为所有这些都可以编译为以下方法:
def readWholeFile(fs: FileSystem, path: Path): Array[String] = {
var inputStream: FSDataInputStream = null
try {
inputStream = fs.open(path)
IOUtils.toString(inputStream, "UTF-8").split("\n")
} finally {
inputStream.close()
}
}
def readFileByLine(fs: FileSystem, path: Path): Array[String] = {
var fileInputStream : FSDataInputStream = null
var bufferedSource : scala.io.BufferedSource = null
try {
fileInputStream = fs.open(path)
bufferedSource = Source.fromInputStream(fileInputStream)
(for (line <- bufferedSource.getLines()) yield line).toArray
} finally {
bufferedSource.close
fileInputStream.close()
}
}
也许这些更容易阅读和精简为使用更少堆栈,编译时间,运行持续时间的代码...