如何转换
org.apache.spark.mllib.linalg.SparseVector
到org.apache.spark.ml.linalg.SparseVector
?
我正在将代码从mllib
转换为ml
api。
import org.apache.spark.mllib.linalg.{DenseVector, Vector}
import org.apache.spark.ml.linalg.{DenseVector => NewDenseVector, Vector => NewVector}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.ml.feature.{LabeledPoint => NewLabeledPoint}
val labelPointData = limitedTable.rdd.map { row =>
new NewLabeledPoint(convertToDouble(row.head), row(1).asInstanceOf[org.apache.spark.ml.linalg.SparseVector])
}
声明row(1).asInstanceOf[org.apache.spark.ml.linalg.SparseVector]
因为以下异常而无法正常工作:
org.apache.spark.mllib.linalg.SparseVector cannot be cast to org.apache.spark.ml.linalg.SparseVector
如何克服这个问题?
我发现代码从mllib
转换为ml
但不是反之亦然。
答案 0 :(得分:5)
可以双向转换。首先,让我们创建一个mllib SparseVector
:
import org.apache.spark.mllib.linalg.Vectors
val mllibVec: org.apache.spark.mllib.linalg.Vector = Vectors.sparse(3, Array(1,2,3), Array(1,2,3))
要转换为ML SparseVector
,只需使用asML
:
val mlVec: org.apache.spark.ml.linalg.Vector = mllibVec.asML
要将其重新转换回来,最简单的方法是使用Vectors.fromML()
:
val mllibVec2: org.apache.spark.mllib.linalg.Vector = Vectors.fromML(mlVec)
此外,在您的代码中,您可以尝试row(1).asInstanceOf[SparseVector]
而不是row.getAs[SparseVector](1)
。尝试将向量作为mllib
向量读取,然后将其转换为asML
并传递到基于ML的LabeledPoint
,即:
val labelPointData = limitedTable.rdd.map { row =>
NewLabeledPoint(convertToDouble(row.head), row.getAs[org.apache.spark.mllb.linalg.SparseVector](1).asML)
}
答案 1 :(得分:0)
在pyspark中,你可以通过这种方式将不同的向量转换成其他的向量:
from pyspark.mllib.linalg import Vectors as mllib_vectors
from pyspark.ml.linalg import Vectors as ml_vectors
# Construct dense vectors in mllib and ml
v1 = mllib_vectors.dense([1.0, 1.0, 0, 0, 0])
v2 = ml_vectors.dense([1.0, 1.0, 0, 0, 0])
print('v1: %s' % v1)
print('v2: %s' % v2)
print(v1 == v2)
print(type(v1), type(v2))
# Convert vector to numpy array
arr1 = v1.toArray()
print('arr1: %s type: %s' % (arr1, type(arr1)))
# convert mllib vectors to ml vectors
v3 = ml_vectors.dense(arr1)
print('v3: %s' % v3)
print(type(v3))
# Convert ml dense vector to sparse vector
arr2 = v2.toArray()
print('arr2', arr2)
d = {i:arr2[i] for i in np.nonzero(arr2)[0]}
print('d', d)
v4 = ml_vectors.sparse(len(arr2), d)
print('v4: %s' % v4)
# Convert ml sparse vector to dense vector
v5 = ml_vectors.dense(v4.toArray())
print('v5: %s' % v5)
# Convert mllib dense vector to sparse vector
v6 = ml_vectors.sparse(len(arr2), d)
print('v6: %s' % v6)
# Convert ml sparse vector to mllib sparse vector
arr3 = v4.toArray()
d = {i:arr3[i] for i in np.nonzero(arr3)[0]}
v7 = mllib_vectors.sparse(len(arr3), d)
print('v7: %s' % v7)
输出为:
v1: [1.0,1.0,0.0,0.0,0.0]
v2: [1.0,1.0,0.0,0.0,0.0]
False
<class 'pyspark.mllib.linalg.DenseVector'> <class 'pyspark.ml.linalg.DenseVector'>
arr1: [1. 1. 0. 0. 0.] type: <class 'numpy.ndarray'>
v3: [1.0,1.0,0.0,0.0,0.0]
<class 'pyspark.ml.linalg.DenseVector'>
arr2 [1. 1. 0. 0. 0.]
d {0: 1.0, 1: 1.0}
v4: (5,[0,1],[1.0,1.0])
v5: [1.0,1.0,0.0,0.0,0.0]
v6: (5,[0,1],[1.0,1.0])
v7: (5,[0,1],[1.0,1.0])