所以在尝试对语料库进行NLP之前,我试图从一组文章中查找并删除所有重复项。我正在使用Spark 2.4处理数据以及SparkML随附的LSH
我尝试了the technique suggested by the LSH authors,在一个数据集上进行了一个selfroxSimilarityJoin,它工作得很好,除了讽刺的是,我留下了重复项,例如
+--------------------+--------------------+--------------------+
| datasetA| datasetB| JaccardDistance|
+--------------------+--------------------+--------------------+
|[7, Donald Trump ...|[3, Donald Trump ...| 0.0|
|[3, Donald Trump ...|[7, Donald Trump ...| 0.0|
|[3, Donald Trump ...|[4, Trump appear...|0.006849315068493178|
|[4, Trump appear...|[7, Donald Trump ...|0.006849315068493178|
|[4, Trump appear...|[3, Donald Trump ...|0.006849315068493178|
|[7, Donald Trump ...|[4, Trump appear...|0.006849315068493178|
|[5, The Priminist...|[6, Theresa May h...|0.011627906976744207|
|[6, Theresa May h...|[5, The Priminist...|0.011627906976744207|
|[1, Theresa May h...|[6, Theresa May h...|0.023255813953488413|
|[6, Theresa May h...|[1, Theresa May h...|0.023255813953488413|
|[5, The Priminist...|[1, Theresa May h...| 0.03448275862068961|
|[1, Theresa May h...|[5, The Priminist...| 0.03448275862068961|
+--------------------+--------------------+--------------------+
(注意:第一个数字是ID,其余的是文本)
这很好,除了我只需要以某种方式保留重复项的一个副本。
有什么想法吗?
btw,有人知道适用于大多数欧洲语言的优秀Tokenizer吗? SparkML是一种非常基本的方法,即“ Some quote”成为['“ Some','quote”']:(