我已经用Spark rdd编写了代码,例如:
val result = data.map(x => {
val eachrecord = x.split(delimiter).map(y => {
y
})
val filterRec = eachrecord.indices.map(x => {
if(selectedIndex.contains(x)){
eachrecord(x)
}
else{
""
}
}).filter(x => x!= "").map(y => y.toDouble).toArray
eachrecord.mkString(delimiter) + delimiter + cendroid.zip(filterRec).foldLeft(0.0) { case (sum, (v1, v2)) => sum + Math.abs(v1 - v2)}
})
这里data
是RDD[String]
selectedIndex
是属于此度量的列。如何在数据框中执行此操作?现在我正在使用Spark 2.3.1。
数据样本:
c1,c2,c3,c4,c5
5.1,3.5,1.4,0.2,0
4.9,3,1.4,0.2,0
4.7,3.2,1.3,0.2,0
4.6,3.1,1.5,0.2,0
5,3.6,1.4,0.2,0
5.4,3.9,1.7,0.4,0
4.6,3.4,1.4,0.3,0
5,3.4,1.5,0.2,0