如何使用PySpark获得对应于最高tf-idf的单词?

时间:2018-10-10 21:29:48

标签: python pyspark tf-idf

我见过类似的帖子,但没有完整的答案,因此在这里发布。

我在Spark中使用TF-IDF在具有最大tf-idf值的文档中获取单词。我使用以下代码。

from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer, StopWordsRemover

tokenizer = Tokenizer(inputCol="doc_cln", outputCol="tokens")
remover1 = StopWordsRemover(inputCol="tokens", 
outputCol="stopWordsRemovedTokens")

stopwordList =["word1","word2","word3"]

remover2 = StopWordsRemover(inputCol="stopWordsRemovedTokens", 
outputCol="filtered" ,stopWords=stopwordList)

hashingTF = HashingTF(inputCol="filtered", outputCol="rawFeatures", numFeatures=2000)

idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=5)

from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[tokenizer, remover1, remover2, hashingTF, idf])

model = pipeline.fit(df)

results = model.transform(df)
results.cache()

我得到类似

的结果
|[a8g4i9g5y, hwcdn] |(2000,[905,1104],[7.34977707433047,7.076179741760428]) 

其中

filtered: array (nullable = true)
features: vector (nullable = true)

如何从“功能”中提取数组?理想情况下,我想得到对应于最高tfidf的单词,如下所示:

|a8g4i9g5y|7.34977707433047

谢谢!

1 个答案:

答案 0 :(得分:1)

  1. 您的feature列的数据包vector中的类型为pyspark.ml.linalg。可能是

    1. pyspark.ml.linalg.DenseVectorsource),例如DenseVector([1., 2.])
    2. pyspark.ml.linalg.SparseVectorsource),例如SparseVector(4, [1, 3], [3.0, 4.0])
  2. 根据您拥有的(2000,[905,1104],[7.34977707433047,7.076179741760428])数据,显然是SparseVector,可以将其分解为3个主要组成部分:

    • size2000
    • indices[905,1104]
    • values[7.34977707433047,7.076179741760428]
  3. 您正在寻找的是该向量的属性values

  4. 使用其他{literal} PySpark SQL类型,例如StringTypeIntegerType,您可以使用SQL函数包(docs)访问其属性(和聚合函数) 。但是vector不是文字SQL类型,访问其属性的唯一方法是通过UDF,例如:

    # Important: `vector.values` returns ndarray from numpy.
    # PySpark doesn't understand ndarray, therefore you'd want to 
    # convert it to normal Python list using `tolist`
    def extract_values_from_vector(vector):
        return vector.values.tolist()
    
    # Just a regular UDF
    def extract_values_from_vector_udf(col):
        return udf(extract_values_from_vector, ArrayType(DoubleType()))
    
    # And use that UDF to get your values
    results.select(extract_values_from_vector_udf('features'), 'features')