Question

我需要读取几个csv文件并将几个列从string转换为Double。

代码如下：

  def f(s:String):Double = s.toDouble

  def readonefile(path:String) = {
    val data = for {
      line <-  sc.textFile( path )
      arr = line.split(",").map(_.trim)
      id = arr(33)
    } yield {
        val countings = ((9 to 14) map arr).toVector map f
        id -> countings.toVector
      }
    data
  }

如果我明确地写了toDouble（例如代码中的函数f），则会抛出错误java.io.IOException或java.lang.ExceptionInInitializerError。

但是，如果我将countings更改为

val countings = ((9 to 14) map arr).toVector map (_.toDouble)

然后一切正常。

函数f是否可序列化？

编辑：

有些人说它与Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects相同但为什么不抛出Task not serializable例外？

Scala版本2.10

Spark版本1.3.1

环境：纱线客户

Answer 1

我们可以将函数f移动到伴随对象中。我也做了转换避免for循环，我不确定它做你想要的。注意，您可能希望使用spark-csv而不是仅仅使用逗号分割，但希望这可以说明它：

  object Panda {
    def f(s:String):Double = s.toDouble
  }

  def readonefile(path:String) = {
      val input = sc.textFile( path )
      arrs = input.map(line => line.split(",").map(_.trim))
      arrrs.map(arr => (arr(33).toDouble,
                        ((9 to 14) map arr).map(Panda.f).toVector)
  }

迷上了Spark的血清

1 个答案: