Question

我试图找到短信之间的相似性（大约100万条短信），在我的实现中，每一行代表一个条目。

为了计算这些文本之间的相似性，我们采用 tfidf 和 columnSimilarities

以下是代码：

import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.feature.IDF
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.distributed.MatrixEntry
import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix
import org.apache.spark.mllib.linalg.distributed.IndexedRow
//import scala.math.Ordering
//import org.apache.spark.RangePartitioner
def transposeRowMatrix(m: RowMatrix): RowMatrix = {
  val indexedRM = new IndexedRowMatrix(m.rows.zipWithIndex.map({
    case (row, idx) => IndexedRow(idx, row)}))
  val transposed = indexedRM.toCoordinateMatrix().transpose.toIndexedRowMatrix()
  new RowMatrix(transposed.rows
    .map(idxRow => (idxRow.index, idxRow.vector))
    .sortByKey().map(_._2))
}
// split word based on apce and specila characters
val documents = sc.textFile("./test1").map(_.split(" |\\,|\\?|\\-|\\+|\\*|\\(|\\)|\\[|\\]|\\{|\\}|\\<|\\>|\\/|\\;|\\.|\\:|\\=|\\^|\\|").filter(_.nonEmpty).toSeq)
val hashingTF = new HashingTF()
val tf = hashingTF.transform(documents)
tf.cache()
println(tf.getNumPartitions)
val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf)
val mat = new RowMatrix(tfidf)
// transpose matrix to get result between row (not between column)
val sim = transposeRowMatrix(mat).columnSimilarities()
val trdd = sim.entries.map{case MatrixEntry(row: Long, col:Long, sim:Double) => Array(row,col,sim).mkString(",")}
println(trdd.getNumPartitions)
// to descease write to file time
val transformedRDD = trdd.repartition(50)
println(transformedRDD.getNumPartitions)
transformedRDD.cache()
ransformedRDD.saveAsTextFile("output")*

问题是当文件中类似消息的数量增加时，相似性会降低。

e.g。

我们假设我们有以下文件：

hello world
hello world 123
how is every thing
we are testing
this is a test
corporate code 123-234 you ca also tap on this link to verify corporate.co/1234
corporate code 134-456 you ca also tap on this link to verify corporate.co/5667

前一个命令的输出是：

％cat output / part-000 *

5.0,6.0,0.7373482646933146
0.0,1.0,0.8164965809277261
4.0,5.0,0.053913565847778636
1.0,5.0,0.13144171271256438
2.0,4.0,0.16888723050548915
4.0,6.0,0.052731941041749664

输出中的每一行代表两行之间的相似性，如下所示： ＆＃34; lineX -1＆＃34;，＆＃34; lineY -1＆＃34;，＆＃34;相似性＆＃34;

显示最后两行之间相似性的输出是5.0,6.0,0.7373482646933146，这很好。

这两行是

corporate code 123-234 you ca also tap on this link to verify corporate.co/1234
corporate code 134-456 you ca also tap on this link to verify corporate.co/5667

且相似度 0.7373482646933146

当文件输入为：

时

hello world
hello world 123
hello world 956248
hello world  2564
how is every thing
we are testing
this is a test
corporate code 123-234 you ca also tap on this link to verify corporate.co/1234
corporate code 134-456 you ca also tap on this link to verify corporate.co/5667
corporate code 456-458 you ca also tap on this link to verify corporate.co/8965
corporate code 444-444 you ca also tap on this link to verify corporate.co/4444

输出是：

7.0,10.0,0.4855543123154418
2.0,3.0,0.32317021425463427
6.0,8.0,0.03657892871242232
6.0,10.0,0.03097823353416634
0.0,1.0,0.6661166307685673
7.0,8.0,0.5733398760974173
1.0,2.0,0.37867439463004254
9.0,10.0,0.4855543123154418
0.0,3.0,0.5684806190668547
8.0,9.0,0.6716256614182469
4.0,6.0,0.1903502047647684
8.0,10.0,0.4855543123154418
1.0,3.0,0.37867439463004254
6.0,9.0,0.03657892871242232
7.0,9.0,0.5733398760974173
6.0,7.0,0.03657892871242232
1.0,7.0,0.233827426275723
0.0,2.0,0.5684806190668547

第一个例子中测试的相同行之间的输出是：7.0,8.0,0.5733398760974173

同一线的相似度从0.7373482646933146降至0.5733398760974173

这两行是：

corporate code 123-234 you ca also tap on this link to verify corporate.co/1234
corporate code 134-456 you ca also tap on this link to verify corporate.co/5667

且相似度 0.5733398760974173

是否有任何解决方案可以避免相似性降低当他们的相似行消息在输入中增加时的句子？（tfidf可能是这里的问题？当相似的句号增加相似性因tfidf而降低？）
是否有任何解决方案来集群类似的消息？

，即上面的输入，包含多个句子，如：

你好世界123

相同的句子如：

公司代码123-234您也可以点击此链接来验证corporate.co/1234

是否可以根据相似性输出进行分组？

文本句子之间的相似之处

你好世界123

公司代码123-234您也可以点击此链接来验证corporate.co/1234

0 个答案: