HDFS文件中的读取行的更轻松方法

时间:2018-07-12 00:34:36

标签: scala hadoop

使用以下代码从单个HDFS文件读取行,我正在使用here中所述的借用方法:

对于import org.apache.hadoop.fs._,以下代码使用fs: FileSystempath: Path。这些函数实际上是包装在类(定义文件系统fs)中的方法。

  /**
    * @param param close-able method
    * @param action data manipulation
    * @return action's return value is bubbled up
    */

  private def using[A <: { def close(): Unit }, B](param: A)(action: A => B): B =
    try {
      action(param)
    } finally {
      param.close()
    }

  /**
    * Open - read - close file while returning file lines
    * 
    * @param path where are file is stored
    * @return array of lines in file  
    */

  def readFileByLine(path: Path): Array[String] = {
    using(fs.open(path)) { fileInputStream => {
      using(Source.fromInputStream(fileInputStream)) { bufferedSource => {
        (for (line <- bufferedSource.getLines()) yield line).toArray
      }
      }
    }
    }
  }

 /**
* Open - read - close file while returning file lines
* 
* @param path where are file is stored
* @return array of lines in file  
*/

  def readWholeFile(path: Path): Array[String] =
    using(fs.open(path)) { inputStream => {
      IOUtils.toString(inputStream, "UTF-8").split("\n")
    }
    }

这两种方法似乎要达到相同的目标需要做两种不同的事情-从HDFS文件中读取行并返回字符串数组。

鉴于这些文件很小,哪种读取方法将被视为标准Scala?这两种方法之间的权衡是什么?

已添加

我认为所有这些都可以编译为以下方法:

  def readWholeFile(fs: FileSystem, path: Path): Array[String] = {
    var inputStream: FSDataInputStream = null
    try {
      inputStream = fs.open(path)
      IOUtils.toString(inputStream, "UTF-8").split("\n")
    } finally {
      inputStream.close()
    }
  }


  def readFileByLine(fs: FileSystem, path: Path): Array[String] = {
    var fileInputStream : FSDataInputStream = null
    var bufferedSource :  scala.io.BufferedSource = null
    try {
      fileInputStream = fs.open(path)
      bufferedSource = Source.fromInputStream(fileInputStream)
      (for (line <- bufferedSource.getLines()) yield line).toArray
    } finally {
      bufferedSource.close
      fileInputStream.close()
    }
  }

也许这些更容易阅读和精简为使用更少堆栈,编译时间,运行持续时间的代码...

0 个答案:

没有答案