Question

到目前为止，我已经能够对所有文档进行标记，并使用Spark的MLLib中的CountVectorizer和IDF。我试图从每个文档中获取前50个单词，但我不确定如何对IDF的输出进行排序。

onePer是文档ID和标记化文档的数据框。

val tf = new CountVectorizer()
  .setInputCol("text")
  .setOutputCol("features").fit(onePer)
  .transform(onePer).select("features").rdd
    .map{x:Row => x.getAs[Vector](0)}

tf.cache()
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)

这就是我的输出的样子（词汇中的单词数，单词的id，单词分数）。我想按分数排序并得到最高的k：

(440,[0,2,3,4,5,6,7,8,9,10,12,15,17,18,19,22,23,24,25,26,27,28,30,31,32,33,34,35,39,41,43,45,47,49,51,52,53,55,57,63,66,69,70,71,74,76,79,80,83,84,85,88,94,95,96,97,99,102,106,107,109,111,117,120,121,124,127,128,129,138,142,145,146,149,154,156,164,166,167,170,171,176,187,189,199,203,204,217,218,219,232,234,236,237,238,240,248,250,251,254,259,263,265,267,280,291,296,302,304,309,319,322,328,333,347,361,364,371,375,384,388,393,395,401,403,433,438,439],[1.3559553712291716,3.9422868018213513,0.6369074622370692,7.795697904781566,3.153829441457081,0.0,5.519201522549892,0.3184537311185346,0.3184537311185346,1.3559553712291716,0.4519851237430572,0.4519851237430572,0.6061358035703155,1.0116009116784799,0.4519851237430572,0.7884573603642703,0.4519851237430572,2.0232018233569597,0.7884573603642703,8.523740461192126,0.6061358035703155,0.6061358035703155,0.6061358035703155,0.6061358035703155,0.7884573603642703,0.6061358035703155,0.6061358035703155,0.6061358035703155,0.7884573603642703,0.7884573603642703,1.0116009116784799,1.0116009116784799,2.0232018233569597,0.7884573603642703,0.7884573603642703,3.897848952390783,0.7884573603642703,0.7884573603642703,1.0116009116784799,5.114244276715276,1.0116009116784799,1.0116009116784799,2.5985659682605218,1.2992829841302609,1.2992829841302609,1.0116009116784799,1.0116009116784799,1.0116009116784799,1.0116009116784799,1.0116009116784799,2.5985659682605218,1.0116009116784799,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,3.4094961844768505,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,3.4094961844768505,1.2992829841302609,1.2992829841302609,1.2992829841302609,3.4094961844768505,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253])

更新

我能够通过以下方式实现这一目标：

tfidf.map(x => x.toSparse).map{x => x.indices.zip(x.values)
  .sortBy(-_._2)
  .take(10)
  .map(_._1)
}

Answer 1

这可能会有所帮助：

scala> val x = (440,Array[Int](0,2,3,4,5,6,7,8,9,10,12,15,17,18,19,22,23,24,25,26,27,28,30,31,32,33,34,35,39,41,43,45,47,49,51,52,53,55,57,63,66,69,70,71,74,76,79,80,83,84,85,88,94,95,96,97,99,102,106,107,109,111,117,120,121,124,127,128,129,138,142,145,146,149,154,156,164,166,167,170,171,176,187,189,199,203,204,217,218,219,232,234,236,237,238,240,248,250,251,254,259,263,265,267,280,291,296,302,304,309,319,322,328,333,347,361,364,371,375,384,388,393,395,401,403,433,438,439),Array[Double](1.3559553712291716,3.9422868018213513,0.6369074622370692,7.795697904781566,3.153829441457081,0.0,5.519201522549892,0.3184537311185346,0.3184537311185346,1.3559553712291716,0.4519851237430572,0.4519851237430572,0.6061358035703155,1.0116009116784799,0.4519851237430572,0.7884573603642703,0.4519851237430572,2.0232018233569597,0.7884573603642703,8.523740461192126,0.6061358035703155,0.6061358035703155,0.6061358035703155,0.6061358035703155,0.7884573603642703,0.6061358035703155,0.6061358035703155,0.6061358035703155,0.7884573603642703,0.7884573603642703,1.0116009116784799,1.0116009116784799,2.0232018233569597,0.7884573603642703,0.7884573603642703,3.897848952390783,0.7884573603642703,0.7884573603642703,1.0116009116784799,5.114244276715276,1.0116009116784799,1.0116009116784799,2.5985659682605218,1.2992829841302609,1.2992829841302609,1.0116009116784799,1.0116009116784799,1.0116009116784799,1.0116009116784799,1.0116009116784799,2.5985659682605218,1.0116009116784799,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,3.4094961844768505,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,3.4094961844768505,1.2992829841302609,1.2992829841302609,1.2992829841302609,3.4094961844768505,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253))

scala> val (r, indices, values) = x
r: Int = 440
indices: Array[Int] = Array(0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 15, 17, 18, 19, 22, 23, 24, 25, 26, 27, 28, 30, 31, 32, 33, 34, 35, 39, 41, 43, 45, 47, 49, 51, 52, 53, 55, 57, 63, 66, 69, 70, 71, 74, 76, 79, 80, 83, 84, 85, 88, 94, 95, 96, 97, 99, 102, 106, 107, 109, 111, 117, 120, 121, 124, 127, 128, 129, 138, 142, 145, 146, 149, 154, 156, 164, 166, 167, 170, 171, 176, 187, 189, 199, 203, 204, 217, 218, 219, 232, 234, 236, 237, 238, 240, 248, 250, 251, 254, 259, 263, 265, 267, 280, 291, 296, 302, 304, 309, 319, 322, 328, 333, 347, 361, 364, 371, 375, 384, 388, 393, 395, 401, 403, 433, 438, 439)
values: Array[Double] = Array(1.3559553712291716, 3.9422868018213513, 0.6369074622370692, 7.795697904781566, 3.153829441457081, 0.0, 5.519201522549892, 0.3184537311185346, 0.31845373...

scala> val topTermIds = indices.zip(values).sortBy( - _._2).take(50).map(_._1)
topTermIds: Array[Int] = Array(26, 4, 7, 63, 2, 52, 109, 124, 138, 5, 70, 85, 24, 47, 176, 187, 189, 199, 203, 204, 217, 218, 219, 232, 234, 236, 237, 238, 240, 248, 250, 251, 254, 259, 263, 265, 267, 280, 291, 296, 302, 304, 309, 319, 322, 328, 333, 347, 361, 364)

现在您需要将上面的代码插入到闭包中，例如：

val topTermsByScore = rdd.map { v: Vector =>
    // to sort decreasing use - 
    v.indices.zip(v.values).sortBy( - _._2).take(50).map(_._1)
}

Spark Scala TF-IDF值排序向量

更新

1 个答案: