Question

鉴于我的pyspark Row对象：

>>> row
Row(clicked=0, features=SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752}))
>>> row.clicked
0
>>> row.features
SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752})
>>> type(row.features)
<class 'pyspark.ml.linalg.SparseVector'>

但是，row.features未能通过isinstance（row.features，Vector）测试。

>>> isinstance(SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752}), Vector)
True
>>> isinstance(row.features, Vector)
False
>>> isinstance(deepcopy(row.features), Vector)
False

这个奇怪的错误让我陷入了巨大的麻烦。没有传递“isinstance（row.features，Vector）”，我无法使用map函数生成LabeledPoint。如果有人能解决这个问题，我将非常感激。

Answer 1

不太可能出错。您没有提供code required to reproduce the issue，但很可能您将Spark 2.0与ML变换器一起使用，并且您比较了错误的实体。

让我们用一个例子来说明。简单数据

from pyspark.ml.feature import OneHotEncoder

row = OneHotEncoder(inputCol="x", outputCol="features").transform(
    sc.parallelize([(1.0, )]).toDF(["x"])
).first()

现在让我们导入不同的矢量类：

from pyspark.ml.linalg import Vector as MLVector, Vectors as MLVectors
from pyspark.mllib.linalg import Vector as MLLibVector, Vectors as MLLibVectors
from pyspark.mllib.regression import  LabeledPoint

并进行测试：

isinstance(row.features, MLLibVector)

False

isinstance(row.features, MLVector)

True

如您所见，pyspark.ml.linalg.Vector而非pyspark.mllib.linalg.Vector与旧API不兼容：

LabeledPoint(0.0, row.features)

TypeError                                 Traceback (most recent call last)
...
TypeError: Cannot convert type <class 'pyspark.ml.linalg.SparseVector'> into Vector

您可以将ML对象转换为MLLib：

from pyspark.ml import linalg as ml_linalg

def as_mllib(v):
    if isinstance(v, ml_linalg.SparseVector):
        return MLLibVectors.sparse(v.size, v.indices, v.values)
    elif isinstance(v, ml_linalg.DenseVector):
        return MLLibVectors.dense(v.toArray())
    else:
        raise TypeError("Unsupported type: {0}".format(type(v)))

LabeledPoint(0, as_mllib(row.features))

LabeledPoint(0.0, (1,[],[]))

或简单地说：

LabeledPoint(0, MLLibVectors.fromML(row.features))

LabeledPoint(0.0, (1,[],[]))

但一般来说，您应该在必要时避免出现这种情况。

Answer 2

如果您只想将SparseVectors从pyspark.ml转换为pyspark.mllib SparseVectors，您可以使用MLUtils。假设 df 是您的数据框，而具有SparseVectors的列名为“features”。然后以下几行让你完成这个：

from pyspark.mllib.utils import MLUtils
df = MLUtils.convertVectorColumnsFromML(df, "features")

我遇到这个问题，因为当使用pyspark.ml.feature中的CountVectorizer时，我无法创建LabeledPoints，因为它与pyspark.ml中的SparseVector不兼容

我想知道为什么他们最新的文档CountVectorizer不使用“新的”SparseVector类。由于分类算法需要LabeledPoints，这对我来说没有意义......

<强>更新：我误解了ml库是为DataFrame-Objects设计的，而mllib库是为RDD对象设计的。建议使用DataFrame-Datastructure，因为Spark＆gt; 2,0，因为SparkSession比SparkContext更兼容（但存储SparkContext对象）并且确实提供DataFrame而不是RDD。我发现这篇帖子让我觉得“aha” - 效果：mllib and ml。谢谢Alberto Bonsanto：）。

使用f.e.来自mllib的NaiveBayes，我不得不将我的DataFrame转换为来自mllib的NaiveBayes的LabeledPoint对象。

但是使用ml中的NaiveBayes更容易，因为你不需要LabeledPoints，但只能为你的数据帧指定feature-和class-col。

PS：我几个小时都在努力解决这个问题，所以我觉得我需要在这里发帖：）

无法将类型<class'pyspark.ml.linalg.sparsevector'=“”>转换为Vector

2 个答案: