Question

我有一个流式传输HDFS文本文件的代码。但是每个文本文件都包含50行的标题和描述。我想忽略这些行并仅摄取数据。

这是我的代码，但它抛出SparkException：Task不可序列化

val hdfsDStream = ssc.textFileStream("hdfs://sandbox.hortonworks.com/user/root/log")

hdfsDStream.foreachRDD(
  rdd => {
    val data = rdd.mapPartitionsWithIndex((partitionIdx: Int, lines: Iterator[String])
    => {
      if (partitionIdx == 0) {
        lines.drop(50)
      }
      lines
    })

    val rowRDD = data.map(_.split(",")).map(p => Row(p(0),p(1),p(2),p(3)))

    if (data.count() > 0) {
        ...
    }
  }
)

Answer 1

在这种情况下，发生任务无序列化错误：Passing Functions to Spark: What is the risk of referencing the whole object?或Task not serializable exception while running apache spark job

很可能你在那里创建某种对象并在RDD方法中调用它的函数，该方法强制引擎序列化你的对象。

不幸的是，您打印的代码部分运行得非常好，问题出在被点替换的部分。例如，这个有效：

import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.sql._

val ssc = new StreamingContext(sc, Seconds(60))
val hdfsDStream = ssc.textFileStream("/sparkdemo/streaming")

hdfsDStream.foreachRDD(
  rdd => {
    val data = rdd.mapPartitionsWithIndex((partitionIdx: Int, lines: Iterator[String])
    => {
      if (partitionIdx == 0) {
        lines.drop(50)
      }
      lines
    })

    val rowRDD = data.map(_.split(",")).map(p => Row(p(0),p(1),p(2),p(3)))

    if (data.count() > 0) {
        rowRDD.take(10).foreach(println)
    }
  }
)
ssc.start()
ssc.awaitTermination()

Answer 2

我认为您只需要zipWithIndex并过滤索引小于50的情况。

val hdfsDStream = ssc.textFileStream("hdfs://sandbox.hortonworks.com/user/root/log")

hdfsDstream.foreachRDD( rdd => {
  val data = rdd.zipWithIndex.filter( _._2 < 50 ).map( _._1 )

  // Now do whatever you want with your data.
} )

另外......在这里 - Row(p(0),p(1),p(2),p(3))，你真的突然需要Row吗？

Spark：ForeachRDD，跳过行抛出任务不可序列化（scala闭包）

2 个答案: