Question

我使用Spark这么新，我对这个问题很感兴趣：

来自我创建的DataFrame;名为reportesBN，我想获取字段的值，以便使用它来获取特定路径的TextFile。然后，给该文件一个特定的过程。

我已经开发了这段代码，但它不起作用：

reportesBN.foreach { 
      x => 
        val file = x(0)
        val insumo = sc.textFile(s"$file")

        val firstRow = insumo.first.split("\\|", -1)

        // Get values of next rows
        val nextRows = insumo.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }

        val dfNextRows = nextRows.map(a => a.split("\\|")).map(x=> BalanzaNextRows(x(0), x(1),
          x(2), x(3), x(4))).toDF() 

        val validacionBalanza = new RevisionCampos(sc)
        validacionBalanza.validacionBalanza(firstRow, dfNextRows)
}

错误日志表明这是因为序列化。

7/06/28 18:55:45 INFO SparkContext: Created broadcast 0 from textFile at ValidacionInsumos.scala:56
Exception in thread "main" org.apache.spark.SparkException: Task not serializable

这个问题是由foreach内部的Spark上下文（sc）引起的吗？

还有其他方法可以实现吗？

的问候。

Answer 1

A very similar question you asked before这就是同一个问题 - 您无法在RDD转换或操作中使用SparkContext。在这种情况下，您在sc.textFile(s"$file")内使用reportesBN.foreach，正如您所说的DataFrame：

来自我创建的DataFrame;叫reportesBN

您应该重写转换以从DataFrame中获取文件并在之后阅读。

// This is val file = x(0)
// I assume that the column name is `files`
val files = reportesBN.select("files").as[String].collectAsList

完成要处理的文件集合后，执行块中的代码。

files.foreach { 
      x => ...
}

如何从DataFrame中的列读取文件名以使用SparkContext.textFile进行处理？

1 个答案: