我有RDD
个字符串(但可能是真的),我想用rdd
随机法线进行内连接。我知道这可以通过两个RDD上的.zipWithIndex来解决,但这似乎不会很好地扩展,有没有办法用来自另一个rdd
的数据初始化随机RDD
或者另一种方法会更快?以下是我对.zipWithIndex
所做的事情:
import org.apache.spark.mllib.random.RandomRDDs
import org.apache.spark.rdd.RDD
val numExamples = 10 // number of rows in RDD
val maNum = 7
val commonStdDev = 0.1 // common standard deviation 1/10, makes variance = 0.01
val normalVectorRDD = RandomRDDs.normalVectorRDD(sc, numRows = numExamples, numCols = maNum)
val rescaledNormals = normalVectorRDD.map{myVec => myVec.toArray.map(x => x*commonStdDev)}
.zipWithIndex
.map{case (key,value) => (value,(key))}
val otherRDD = sc.textFile(otherFilepath)
.zipWithIndex
.map{case (key,value) => (value,(key))}
val joinedRDD = otherRDD.join(rescaledNormals).map{case(key,(other,dArray)) => (other,dArray)}
答案 0 :(得分:1)
一般来说,我不担心zipWithIndex
。虽然它需要额外的操作,但它属于相对便宜的操作。 join
然而却是另一回事。
由于矢量内容不依赖于otherRDD
的值,因此在适当的位置生成它更有意义。您所要做的就是模仿RandomRDDs
逻辑:
import org.apache.spark.mllib.random.StandardNormalGenerator
import org.apache.spark.ml.linalg.DenseVector // or org.apache.spark.mllib
val vectorSize = 42
val stdDev = 0.1
val seed = scala.util.Random.nextLong // Or set manually
// Define seeds for each partition
val random = new scala.util.Random(seed)
val seeds = (0 until otherRDD.getNumPartitions).map(
i => i -> random.nextLong
).toMap
otherRDD.mapPartitionsWithIndex((i, iter) => {
val generator = new StandardNormalGenerator()
generator.setSeed(seeds(i))
iter.map(x =>
(x, new DenseVector(Array.fill(vectorSize)(generator.nextValue() * stdDev)))
)
})