I'm trying to find similar documents based on their text in spark. I'm using python with Spark.
So far I implemented RowMatrix, IndexedRowMatrix, and CoordinateMatrix to set this up. And then I implemented columnSimilarities (DIMSUM). The problem with DIMSUM is that it's optimized for a lot of features, a few items. http://stanford.edu/~rezab/papers/dimsum.pdf
Our initial approach was to create tf-idf vectors of all words in all documents, then transpose it into a rowmatrix where we have a row for each word and a column for each item. Then we ran columnSimilarities which gives us a coordinateMatrix of ((item_i, item_j), similarity). This just doesn't work well when number of columns > number of rows.
We need a way to calculate all-pairs similarity with a lot of items, a few features. #items=10^7 #features=10^4. At a higher level, we're trying to create an item based recommender that given one item, will return a few quality recommendations based only on the text.
答案 0 :(得分:0)
我写这篇文章作为评论而不是答案,但我不会让我满意。
利用ElasticSearch的{{3}}可以“平凡地”解决这个问题。从文档中你可以看到它是如何工作的以及考虑了哪些因素,即使你最终在Python中实现它也应该是有用的信息。
他们还实施了其他有趣的算法,例如more-like-this query。