使用具有自近似的LSH问题在Spark(Scala)中进行Deduping

时间:2019-05-20 05:46:54

标签: scala apache-spark nlp lsh

所以在尝试对语料库进行NLP之前,我试图从一组文章中查找并删除所有重复项。我正在使用Spark 2.4处理数据以及SparkML随附的LSH

我尝试了the technique suggested by the LSH authors,在一个数据集上进行了一个selfroxSimilarityJoin,它工作得很好,除了讽刺的是,我留下了重复项,例如

+--------------------+--------------------+--------------------+                
|            datasetA|            datasetB|     JaccardDistance|
+--------------------+--------------------+--------------------+
|[7, Donald Trump ...|[3, Donald Trump ...|                 0.0|
|[3, Donald Trump ...|[7, Donald Trump ...|                 0.0|
|[3, Donald Trump ...|[4,  Trump appear...|0.006849315068493178|
|[4,  Trump appear...|[7, Donald Trump ...|0.006849315068493178|
|[4,  Trump appear...|[3, Donald Trump ...|0.006849315068493178|
|[7, Donald Trump ...|[4,  Trump appear...|0.006849315068493178|
|[5, The Priminist...|[6, Theresa May h...|0.011627906976744207|
|[6, Theresa May h...|[5, The Priminist...|0.011627906976744207|
|[1, Theresa May h...|[6, Theresa May h...|0.023255813953488413|
|[6, Theresa May h...|[1, Theresa May h...|0.023255813953488413|
|[5, The Priminist...|[1, Theresa May h...| 0.03448275862068961|
|[1, Theresa May h...|[5, The Priminist...| 0.03448275862068961|
+--------------------+--------------------+--------------------+

(注意:第一个数字是ID,其余的是文本)

这很好,除了我只需要以某种方式保留重复项的一个副本。

有什么想法吗?

Toy corpus i used

btw,有人知道适用于大多数欧洲语言的优秀Tokenizer吗? SparkML是一种非常基本的方法,即“ Some quote”成为['“ Some','quote”']:(

0 个答案:

没有答案