Can Locality Sensitive Hashing可用于动态数据吗?

时间:2015-09-01 15:43:00

标签: algorithm string-matching nearest-neighbor locality-sensitive-hash

可以在动态数据上使用Locality Sensitive Hashing吗?例如,假设我首先在1,000,000个文档上使用LSH并将结果存储在索引上,然后我想将另一个文档添加到创建的索引中。我可以用LSH吗?

2 个答案:

答案 0 :(得分:2)

是。

由于lsh使用多个哈希来生成多个签名,因此这个签名被绑定以生成索引。如果存储随机散列函数和条带化过程,则可以重复使用它来为新插入生成索引。因此,对于每个新插入,您将具有相应的索引

答案 1 :(得分:1)

是的,你可以这样做。您只需计算添加文档与其余文档的Jaccard相似度,并将其添加到索引中。

TABLE Documents (
  ID INT IDENTITY(1,1) PRIMARY KEY NOT NULL, 
  MinHashes BINARY(512), -- serialized Min Hash results
  Name NVARCHAR(255) UNIQUE NOT NULL, 
  Content VARBINARY(MAX)
)

TABLE SimilarDocumentIndex (
  DocumentAID INT REFERENCES Documents(ID),
  DocumentBID INT REFERENCES Documents(ID),
  Similarity FLOAT, -- Jaccard Similarity 0.0...1.0
  PRIMARY KEY CLUSTERED (DocumentAID, DocumentBID)
)

--
-- Find similar documents
--
SELECT TOP 20 DISTINCT DocumentID
FROM (SELECT 
FROM SimilarDocumentIndex 
WHERE DocumentAID = @DocumentID 
ORDER BY Similarity DESC

--
-- Compare two documents
--    
SELECT Similarity 
FROM SimilarDocumentIndex 
WHERE DocumentAID = @DocumentAID AND DocumentBID = @DocumentBID

--
-- Adding a new document
--
SET @MinHashes = dbo.CalcMinHashes(@content)

INSERT INTO Document 
VALUES(@MinHashes, @name, @content)

SET @DocumentID = SCOPE_IDENTITY()

INSERT INTO SimilarDocumentIndex
  SELECT @DocumentID, ID, dbo.JaccardSimilarity(@MinHashes, MinHashes)
  FROM Documents 
  WHERE ID <> @DocumentID 

INSERT INTO SimilarDocumentIndex
  SELECT DocumentBID, @DocumentID, Similarity
  FROM SimilarDocumentIndex
  WHERE DocumentAID = @DocumentID