Question

我尝试使用字符串的rdd作为词典和org.apache.spark.mllib.random包中的类RandomDataGenerator来创建字符串采样器。

GetValue

import org.apache.spark.mllib.random.RandomDataGenerator
import org.apache.spark.rdd.RDD
import scala.util.Random

class StringSampler(var dic: RDD[String], var seed: Long = System.nanoTime) extends RandomDataGenerator[String] {
    require(dic != null, "Dictionary cannot be null")
    require(!dic.isEmpty, "Dictionary must contains lines (words)")
    Random.setSeed(seed)

    var fraction: Long = 1 / dic.count()

    //return a random line from dictionary
    override def nextValue(): String = dic.sample(withReplacement = true, fraction).take(1)(0)

    override def setSeed(newSeed: Long): Unit = Random.setSeed(newSeed)

    override def copy(): StringSampler = new StringSampler(dic)

    def setDictionary(newDic: RDD[String]): Unit = {
        require(newDic != null, "Dictionary cannot be null")
        require(!newDic.isEmpty, "Dictionary must contains lines (words)")
        dic = newDic
        fraction = 1 / dic.count()
    }
}

但是我遇到一个SparkException，当我尝试生成一个随机的字符串RDD时，字典缺少一个SparkContext。当将字典rdd复制到集群节点时，似乎火花正在丢失字典rdd的上下文，而我不知道如何解决它。

我尝试在将字典传递给StringSampler之前缓存它，但它没有改变任何东西...... 我正在考虑将它链接回原始的SparkContext，但我甚至不知道它是否可能。有人有想法吗？

val dictionaryName: String
val dictionaries: Broadcast[Map[String, RDD[String]]]
val dictionary: RDD[String] = dictionaries.value(dictionaryName) // dictionary.cache()
val sampler = new StringSampler(dictionary)
RandomRDDs.randomRDD(context, sampler, size, numPartitions)

Answer 1

我相信问题在这里：

val dictionaries: Broadcast[Map[String, RDD[String]]]
val dictionary: RDD[String] = dictionaries.value(dictionaryName)

您不应播放包含RDD的任何内容。 RDD已经并行化并在整个集群中传播。该错误来自于尝试序列化和反序列化RDD，这会失去其背景并且无论如何都是毫无意义的。

这样做：

val dictionaries: Map[String, RDD[String]]
val dictionary: RDD[String] = dictionaries(dictionaryName)

SparkException：此RDD缺少SparkContext

1 个答案: