使用Spark CountVectorizer时如何“标准化”矢量值?

时间:2018-02-09 20:55:51

标签: apache-spark countvectorizer

CountVectorizerCountVectorizerModel经常会创建一个稀疏的特征向量,如下所示:

(10,[0,1,4,6,8],[2.0,1.0,1.0,1.0,1.0])

这基本上说词汇表的总大小是10,当前文档有5个独特的元素,在特征向量中,这5个独特的元素占据0,1,4,6和8的位置。元素显示两次,因此2.0值。

现在,我想“规范化”上面的特征向量并使其看起来像这样,

(10,[0,1,4,6,8],[0.3333,0.1667,0.1667,0.1667,0.1667])

即,每个值除以6,即所有元素的总数。例如,0.3333 = 2.0/6

那么有没有办法在这里有效地做到这一点?

谢谢!

1 个答案:

答案 0 :(得分:2)

您可以使用Normalizer

  

class pyspark.ml.feature.Normalizer(*args, **kwargs)

     

使用给定的p-norm将向量标准化为具有单位范数。

1-norm

from pyspark.ml.linalg import SparseVector
from pyspark.ml.feature import Normalizer

df = spark.createDataFrame([
    (SparseVector(10,[0,1,4,6,8],[2.0,1.0,1.0,1.0,1.0]), )
], ["features"])

Normalizer(inputCol="features", outputCol="features_norm", p=1).transform(df).show(1, False)
# +--------------------------------------+---------------------------------------------------------------------------------------------------------------------+
# |features                              |features_norm                                                                                                        |
# +--------------------------------------+---------------------------------------------------------------------------------------------------------------------+
# |(10,[0,1,4,6,8],[2.0,1.0,1.0,1.0,1.0])|(10,[0,1,4,6,8],[0.3333333333333333,0.16666666666666666,0.16666666666666666,0.16666666666666666,0.16666666666666666])|
# +--------------------------------------+---------------------------------------------------------------------------------------------------------------------+