如何在RDD的每一行上应用“Sum(vi * ln(vi))”org.apache.spark.rdd.RDD [(Long,org.apache.spark.mllib.linalg.Vector)]“

时间:2017-02-25 06:59:18

标签: scala apache-spark apache-spark-mllib

我有一个这种结构的RDD

org.apache.spark.rdd.RDD[(Long, org.apache.spark.mllib.linalg.Vector)]

此处每行RDD包含索引Long和向量org.apache.spark.mllib.linalg.Vector。我想将以下函数应用于每个向量中的每个向量。

函数是:Sum(vi * ln(vi)),其中vi =向量的第i个分量。

请指导我如何将此功能应用于具有上述scala中所述结构的RDD。

示例行如下所示:

Array[(Long, org.apache.spark.mllib.linalg.Vector)] = 
      Array((0,[0.024866109194373365,0.025451635045582396,0.024940244042347803,
                0.025318245591768037,0.026531498776299952,0.02335951025503321,
                0.02388238099930112,0.023397342214386187,0.024965559145567116,
                0.023650490684903713,0.023343404489401316,0.024368157919182634,
                0.02526665811061871,0.025846888476461573,0.025874255477319974))

1 个答案:

答案 0 :(得分:1)

我们可以尝试将您的Vector列转换为Array类型,这样我们就可以将x * log(x)映射到每个元素,最后sum生成Array第二次mapValues电话:

rdd.mapValues(_.toArray.map(x => scala.math.log(x) * x)).mapValues(_.sum)