如何在SparkML中按值排序Vector?

时间:2016-11-17 03:52:06

标签: apache-spark spark-dataframe

我在SparkML中使用tf-idf算法获得了一些特征向量。现在我想获得每个Vector中的最大值。如何按值对Vector进行排序或获取它的最大值?

import org.apache.spark.ml.linalg.Vector
val testDF = spark.read.json("/dataset/yelp_review_test.json")
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val wordsData = tokenizer.transform(testDF)
//wordsData.show()
val hashTF = new HashingTF().setInputCol("words").setOutputCol("tfFeatures")
val tfFeatures = hashTF.transform(wordsData)
//tfFeatures.select("review_id","words","tfFeatures").foreach(println(_))
val idf = new IDF().setInputCol("tfFeatures").setOutputCol("idfFeatures")

val idfModel = idf.fit(tfFeatures)
val allDF = idfModel.transform(tfFeatures)
allDF.show()

enter image description here

idfFeatures的行向量是这样的:

(262144,[7617,24417,36200,61231,65069,66865,95805,103838,117481,138356,142373,151536,161061,189683,200556,204852,205044,218917,222453,227410,232735,235447],[2.1972245773362196,0.1823215567939546,1.5040773967762742,0.49247648509779424,1.791759469228055,1.2809338454620642,1.2809338454620642,0.0,1.791759469228055,1.0986122886681098,2.1972245773362196,0.8109302162163288,2.1972245773362196,0.25131442828090617,2.1972245773362196,2.1972245773362196,0.4054651081081644,1.791759469228055,1.888923217681703,0.0,2.1972245773362196,2.1972245773362196])

1 个答案:

答案 0 :(得分:0)

因为它是一个sparkML矢量,你可以把它转换成一个普通的集合,并使用可用的函数来找到这样的最大值:

<data>
    <variable name="viewModel"
              type="com.aapp.viewmodel.TestSpinnerViewModel"/>
</data>
<LinearLayout android:layout_width="match_parent"
              android:layout_height="wrap_content">
   <android.support.v7.widget.AppCompatSpinner
        android:layout_width="wrap_content"
        android:layout_height="match_parent"
        android:id="@+id/sTimeHourSpinner"
        android:entries="@{viewModel.startTimeHourSelections}"
        android:selectedItemPosition="@={viewModel.startHourIdx}"/>
</LinearLayout>

或者替换元组中的向量:

myVector.toArray.reduce( (a, b) => if (a > b) a else b )