如何在DataFrame中获取Vector

时间:2016-11-16 07:35:10

标签: scala apache-spark

我使用SparkML TF-IDF算法获得一些特征向量。现在我想在“idfFeatures”列中获取Vector。

enter image description here

我的代码是:

$('#table').find('tr').each(function(){
        $(this).find('td').eq(x).after('<td>&nbsp;</td>');
   });

控制台中有一个错误:

val vectors = allDF.select("idfFeatures").map{
  case Row(vector: Vector) =>
    vector
}
vectors.foreach(println(_))

如果我将Vector更改为String,则还有另一个错误:

Error:(38, 24) type Vector takes type parameters
  case Row(vector: Vector) =>
                   ^

我如何获得Vector?

1 个答案:

答案 0 :(得分:1)

Spark 1.x:

import org.apache.spark.mllib.linalg.Vector

Spark 2.0:

import org.apache.spark.ml.linalg.Vector

示例:

// https://spark.apache.org/docs/latest/ml-features.html#tf-idf

import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}

val sentenceData = spark.createDataFrame(Seq(
  (0, "Hi I heard about Spark"),
  (0, "I wish Java could use case classes"),
  (1, "Logistic regression models are neat")
)).toDF("label", "sentence")

val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val wordsData = tokenizer.transform(sentenceData)
val hashingTF = new HashingTF()
  .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
val featurizedData = hashingTF.transform(wordsData)

val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)

val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData)
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row

rescaledData.select("features").rdd.map { case Row(v: Vector) => v}.first