我一直在使用mllib的功能实现Python / Pyspark中描述的TF-IDF方法:
https://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html
我有150个文本文档的训练集,一个包含80个文本文档的测试集。我已经为训练和测试产生了哈希表TF-IDF RDD(稀疏向量),即称为tfidf_train和tfidf_test的词袋表示。 IDF由两者共享,仅基于培训数据。我的问题是关于如何处理稀疏的RDD,那里的信息非常少。
我现在想要有效地将80个测试文档TF-IDF向量中的每一个映射到它共享最高余弦相似性的训练TF-IDF向量。通过执行tfidf_test.first(),我看到每个稀疏的TF-IDF向量(组成RDD')看起来像这样:
SparseVector(1048576,{0:15.2313,9377:8.6483,16538:4.3241,45005:4.3241,67046:5.0173,80280:4.3241,83104:2.9378,83107:3.0714,87638:3.9187,90331:3.9187,110522: 1.7592,138394:3.631,140318:4.3241,147576:4.3241,165673:4.3241,172912:3.9187,179664:4.3241,179767:5.0173,189356:1.047,190616:4.3241,192712:4.3241,193790:3.4078,220545:3.9187, 221050:3.4078,229110:3.4078,232286:2.0728,240477:3.631,241582:4.3241,242620:3.9187,245388:5.0173,252569:2.8201,255985:5.0173,266130:4.3241,277170:3.9187,277863:4.3241,298406: 4.3241,323505:4.3241,326993:3.2255,330297:4.3241,334392:3.4078,354917:3.631,355604:3.9187,365855:4.3241,383386:2.9378,386534:4.3241,387896:3.2255,392015:4.3241,395372:1.4619, 406995:3.4078,414351:5.0173,433323:4.3241,434512:4.3241,438171:4.3241,439468:4.3241,453414:3.9187,454316:4.3241,456931:3.9187,461229:3.631,488050:5.0173,506649:4.3241,508845: 3.0714,512698:4.3241,5 26484:8.6483,548929:2.8201,549530:4.3241,550044:3.631,555900:4.3241,557206:6.451,570917:1.8392,618498:3.4078,623040:3.5968,637333:4.3241,645028:2.9378,669449:3.0714,676506: 4.3241,699388:4.3241,702049:2.3782,715677:3.4078,733071:3.9187,738831:3.631,743497:8.6483,782907:1.047,793071:4.3241,801052:4.3241,805189:3.2255,811506:4.3241,812013:4.3241, 819994:4.3241,837270:4.3241,848755:3.9187,852042:4.3241,866553:4.3241,872996:3.2255,908183:5.0173,914226:8.6483,921216:4.3241,925934:4.3241,927892:4.3241,935542:5.0173,941563: 1.0855,958430:3.4078,959994:1.7984,977239:3.9187,978895:3.0714,1001818:3.2255,1002343:3.2255,1016145:4.3241,1017725:4.3241,1031685:8.1441})
我不确定如何比较RDD,但我认为reduceByKey(lambda x,y:x * y)可能很有用。有没有人有任何想法如何扫描每个测试向量并输出到元组(从训练集匹配的向量,余弦相似值)?
任何帮助表示赞赏!