我尝试使用字符串的rdd作为词典和org.apache.spark.mllib.random包中的类RandomDataGenerator来创建字符串采样器。
GetValue
import org.apache.spark.mllib.random.RandomDataGenerator
import org.apache.spark.rdd.RDD
import scala.util.Random
class StringSampler(var dic: RDD[String], var seed: Long = System.nanoTime) extends RandomDataGenerator[String] {
require(dic != null, "Dictionary cannot be null")
require(!dic.isEmpty, "Dictionary must contains lines (words)")
Random.setSeed(seed)
var fraction: Long = 1 / dic.count()
//return a random line from dictionary
override def nextValue(): String = dic.sample(withReplacement = true, fraction).take(1)(0)
override def setSeed(newSeed: Long): Unit = Random.setSeed(newSeed)
override def copy(): StringSampler = new StringSampler(dic)
def setDictionary(newDic: RDD[String]): Unit = {
require(newDic != null, "Dictionary cannot be null")
require(!newDic.isEmpty, "Dictionary must contains lines (words)")
dic = newDic
fraction = 1 / dic.count()
}
}
但是我遇到一个SparkException,当我尝试生成一个随机的字符串RDD时,字典缺少一个SparkContext。当将字典rdd复制到集群节点时,似乎火花正在丢失字典rdd的上下文,而我不知道如何解决它。
我尝试在将字典传递给StringSampler之前缓存它,但它没有改变任何东西...... 我正在考虑将它链接回原始的SparkContext,但我甚至不知道它是否可能。有人有想法吗?
val dictionaryName: String
val dictionaries: Broadcast[Map[String, RDD[String]]]
val dictionary: RDD[String] = dictionaries.value(dictionaryName) // dictionary.cache()
val sampler = new StringSampler(dictionary)
RandomRDDs.randomRDD(context, sampler, size, numPartitions)
答案 0 :(得分:0)
我相信问题在这里:
val dictionaries: Broadcast[Map[String, RDD[String]]]
val dictionary: RDD[String] = dictionaries.value(dictionaryName)
您不应播放包含RDD的任何内容。 RDD已经并行化并在整个集群中传播。该错误来自于尝试序列化和反序列化RDD,这会失去其背景并且无论如何都是毫无意义的。
这样做:
val dictionaries: Map[String, RDD[String]]
val dictionary: RDD[String] = dictionaries(dictionaryName)