如何使用数据框代替rdd查找曼哈顿距离度量?

时间:2018-06-29 13:31:44

标签: scala apache-spark

我已经用Spark rdd编写了代码,例如:

val result = data.map(x => {
      val eachrecord = x.split(delimiter).map(y => {
        y
      })
      val filterRec = eachrecord.indices.map(x => {
        if(selectedIndex.contains(x)){
          eachrecord(x)
        }
        else{
          ""
        }
      }).filter(x => x!= "").map(y => y.toDouble).toArray
      eachrecord.mkString(delimiter) + delimiter + cendroid.zip(filterRec).foldLeft(0.0) { case (sum, (v1, v2)) => sum + Math.abs(v1 - v2)}
    })

这里dataRDD[String]

selectedIndex是属于此度量的列。如何在数据框中执行此操作?现在我正在使用Spark 2.3.1。 数据样本:

   c1,c2,c3,c4,c5
   5.1,3.5,1.4,0.2,0
   4.9,3,1.4,0.2,0
   4.7,3.2,1.3,0.2,0
   4.6,3.1,1.5,0.2,0 
   5,3.6,1.4,0.2,0
   5.4,3.9,1.7,0.4,0
   4.6,3.4,1.4,0.3,0  
   5,3.4,1.5,0.2,0

0 个答案:

没有答案