Question

for (fordate <- 2 to 30) {
  val dataRDD = sc.textFile("s3n://mypath" + fordate + "/*")
  val a = 1
  val c = fordate - 1
  for (b <- a to c) {
    val cumilativeRDD1 = sc.textFile("s3n://mypath/" + b + "/*")
    val cumilativeRDD : org.apache.spark.rdd.RDD[String]  = sc.union(cumilativeRDD1, cumilativeRDD)
    if (b == c) {
      val incrementalDEviceIDs = dataRDD.subtract(cumilativeRDD)
      val countofIDs = incrementalDEviceIDs.distinct().count()
      println(s"201611 $fordate  $countofIDs")
    }
  }
}

我有一个数据集，我每天都会获得deviceID。我需要计算出每天的增量计数，但是当我加入cumilativeRDD时，它会说出错误跟踪：

前向参考延伸超过值cumilativeRDD的定义

我怎样才能克服这一点。

Answer 1

问题在于这一行：

val cumilativeRDD : org.apache.spark.rdd.RDD[String]  = sc.union(cumilativeRDD1 ,cumilativeRDD)

您在声明之前使用cumilativeRDD。变量赋值从右到左工作。 =的右侧定义左侧的变量。因此，您无法在其自己的定义中使用该变量。因为在等式的右边，变量尚不存在。

您必须在第一次运行时初始化cumilativeRDD，然后您可以在以下运行中使用它：

 var cumilativeRDD: Option[org.apache.spark.rdd.RDD[String]] = None
 for (fordate <- 2 to 30) {
    val DataRDD = sc.textFile("s3n://mypath" + fordate + "/*")
    val c = fordate - 1
    for (b <- 1 to c) {
      val cumilativeRDD1 = sc.textFile("s3n://mypath/" + b + "/*")
      if (cumilativeRDD.isEmpty) cumilativeRDD = Some(cumilativeRDD1)
      else cumilativeRDD = Some(sc.union(cumilativeRDD1, cumilativeRDD.get))

      if (b == c) {
        val IncrementalDEviceIDs = DataRDD.subtract(cumilativeRDD.get)
        val countofIDs = IncrementalDEviceIDs.distinct().count()
        println("201611" + fordate + "  " + countofIDs)
      }
    }
  }

Scala - 将RDD附加到自身

1 个答案: