All-pairs similarity using tfidf vectors in pyspark

时间:2015-07-28 16:35:12

标签: apache-spark machine-learning pyspark apache-spark-mllib tf-idf

I'm trying to find similar documents based on their text in spark. I'm using python with Spark.

So far I implemented RowMatrix, IndexedRowMatrix, and CoordinateMatrix to set this up. And then I implemented columnSimilarities (DIMSUM). The problem with DIMSUM is that it's optimized for a lot of features, a few items. http://stanford.edu/~rezab/papers/dimsum.pdf

Our initial approach was to create tf-idf vectors of all words in all documents, then transpose it into a rowmatrix where we have a row for each word and a column for each item. Then we ran columnSimilarities which gives us a coordinateMatrix of ((item_i, item_j), similarity). This just doesn't work well when number of columns > number of rows.

We need a way to calculate all-pairs similarity with a lot of items, a few features. #items=10^7 #features=10^4. At a higher level, we're trying to create an item based recommender that given one item, will return a few quality recommendations based only on the text.

1 个答案:

答案 0 :(得分:0)

我写这篇文章作为评论而不是答案,但我不会让我满意。

利用ElasticSearch的{​​{3}}可以“平凡地”解决这个问题。从文档中你可以看到它是如何工作的以及考虑了哪些因素,即使你最终在Python中实现它也应该是有用的信息。

他们还实施了其他有趣的算法,例如more-like-this query