从Scala中的HDFS加载.csv文件

时间:2017-12-13 13:14:04

标签: scala hadoop hdfs

所以我基本上有以下代码来读取.csv文件并将其存储在Array[Array[String]]中:

def load(filepath: String): Array[Array[String]] = {
      var data = Array[Array[String]]()
      val bufferedSource = io.Source.fromFile(filepath)
      for (line <- bufferedSource.getLines) {
        data :+ line.split(",").map(_.trim)
      }
      bufferedSource.close
      return data.slice(1,data.length-1) //skip header
  }

适用于未存储在HDFS上的文件。但是,当我在HDFS上尝试同样的事情时,我得到了

  

找不到此类文件或目录

在写入HDFS上的文件时,我还必须更改原始代码并向FileSystem添加一些PathPrintWriter个参数,但这次我根本不知道如何这样做。

我到目前为止:

  def load(filepath: String, sc: SparkContext): Array[Array[String]] = {
      var data = Array[Array[String]]()
      val fs = FileSystem.get(sc.hadoopConfiguration)
      val stream = fs.open(new Path(filepath))
      var line = ""
      while ((line = stream.readLine()) != null) {
        data :+ line.split(",").map(_.trim)
      }

      return data.slice(1,data.length-1) //skip header
  }

这应该可行,但是在将行与null比较或者长度超过0时,我得到NullPointerException

2 个答案:

答案 0 :(得分:1)

此代码将从HDFS读取.csv文件:

  def read(filepath: String, sc: SparkContext): ArrayBuffer[Array[String]] = {
      var data = ArrayBuffer[Array[String]]()
      val fs = FileSystem.get(sc.hadoopConfiguration)
      val stream = fs.open(new Path(filepath))
      var line = stream.readLine()
      while (line != null) {
        val row = line.split(",").map(_.trim)
        data += row
        line = stream.readLine()
      }
      stream.close()

      return data // or return data.slice(1,data.length-1) to skip header
  }

答案 1 :(得分:-1)

请阅读Scala Cookbook的作者Alvin Alexander撰写的this post about reading CSV

object CSVDemo extends App {
  println("Month, Income, Expenses, Profit")
  val bufferedSource = io.Source.fromFile("/tmp/finance.csv")
  for (line <- bufferedSource.getLines) {
    val cols = line.split(",").map(_.trim)
    // do whatever you want with the columns here
    println(s"${cols(0)}|${cols(1)}|${cols(2)}|${cols(3)}")
  }
  bufferedSource.close
}

您只需从HDFS获取InputStream并替换此代码段