我有以下数据,其中包含id,text。我想在文本列中找到频繁出现的单词,这些单词的行是相同的“id”。我不想考虑同一行中经常出现的单词(文本列中的句子)。我尝试使用TF-IDF算法来实现它。
id,text
1,Interface Down GigabitEthernet0/1/2 null .
1,Interface Gi0/1/2 Down on node BMAC69RT01
1,Interface Down MEth0/0/1 null .
1,Interface MEth0/0/1 Down on node
2,Interface Up FastEthernet0/0/0 null
2,Interface Fa0/0/0 Down on node
首先我从文本栏
创建了令牌val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
然后我尝试使用countvectorizer和IDF来获取常用的单词。我认为这里不需要countvectorizer,因为我不需要在同一个句子中考虑术语频率。
val countVectors = new CountVectorizer()
.setInputCol("words")
.setOutputCol("vectorText")
val idf = new IDF().setInputCol("vectorText").setOutputCol("features")
这给我一个输出如下
|1 |(11,[0,1,2,6],[0.0,0.15415067982725836,0.3364722366212129,1.252762968495368])
|1 |(11,[0,1,2,3,4,5,8],[0.0,0.3083013596545167,0.3364722366212129,1.1192315758708453,0.5596157879354227,0.5596157879354227,1.252762968495368])
|1 |(11,[0,1,2,3],[0.0,0.15415067982725836,0.3364722366212129,0.5596157879354227])
|1 |(11,[0,1,3,4,5],[0.0,0.15415067982725836,0.5596157879354227,0.5596157879354227,0.5596157879354227])
|2 |(11,[0,2,7,9],[0.0,0.3364722366212129,1.252762968495368,1.252762968495368])
|2 |(11,[0,1,4,5,10],[0.0,0.15415067982725836,0.5596157879354227,0.5596157879354227,1.252762968495368])
我知道上面的输出给出了每个单词的功能和频率。但是从上面的输出我怎样才能得到真实的单词及其频率。我想要一个类似于以下输出的输出。 spark中可用的任何其他算法,以实现低于输出
Label | (Word, Frequency)
1, | (Interface, 4) (Down, 4) (null, 2) (on, 2)
2, | (Interface, 2)
答案 0 :(得分:1)
认为这篇文章可能会对您有所帮助,以下是使用fold
操作获取所需输出的方法
import scala.io.Source
Source.fromFile("fileName").getLines()
.toList.tail //remove headers
.foldLeft(Map.empty[Int,Map[String,Int]]){ //use fold with seed
(map, line) => {
val words = line.split("\\W+") //split each line into words
val lineNumber = words.apply(0).toInt //get line number this can throw error
var existingCount = map.getOrElse(lineNumber, Map.empty[ String, Int]) //check for existing count
words.tail.map(word => {
val result: Int = existingCount.getOrElse(word,0)
existingCount = existingCount + (word -> (result + 1))
})
map + (lineNumber -> existingCount)
}
}.foreach(e => println(e._1+ " | "+e._2.map(x => "("+x._1+", "+x._2+")")))
// 1 | List((Interface, 3), (MEth0, 2), (BMAC69RT01, 1), (null, 1), (1, 3), (on, 2), (Down, 3), (0, 2), (Gi0, 1), (2, 1), (node, 2))
// 2 | List((Interface, 2), (null, 1), (Fa0, 1), (on, 1), (Down, 1), (0, 4), (FastEthernet0, 1), (Up, 1), (node, 1))