Question

我有postgreSQL 9.2

我的任务是在表格中找到相似的名字（受到一些levenshtain距离的限制）。

例如，距离为3，表格包含数据：

|           name            |
|***************************|
|       Marcus Miller       |
|       Marcos Miller       |
|       Macus Miler         |
|       David Bowie         |
|       Dave Grohl          |
|       Dav Grol            |
|           ...             |

我想得到的结果是这样的：

|       Marcus Miller, Marcos Miller, Macus Miler       |
|       Dave Grohl, Dav Grol                            |
|           ...                                         |

或者

|       Marcus Miller, Marcos Miller                    | 
|       Marcus Miller, Macus Miler                      |
|       Dave Grohl, Dav Grol                            |
|           ...                                         |

我试过了：

SELECT a.name, b.name
FROM my_table a
JOIN my_table b ON b.id < a.id AND levenshtein(b.name, a.name) < 3;

但是我的数据太慢了。

Answer 1

您的问题存在重大概念错误; GROUP BY将某些equivalence relations（在数学意义上）作为参数，并将partition SQL关系用于equivalence classes。

问题在于，您描述的关系，即“在彼此的某个编辑距离内是两个字符串”，不是等价关系。它是对称的和反身的，但不是传递性的。为了说明，如果我在数据集中添加了一系列名称，将“Marcus Miller”变成“Dave Grohl”，并且系列中的每个名称都在距离之前的编辑距离内，那么答案应该是什么？

但是，有些算法使用非等价关系的东西来划分数据集，例如几何距离。 K-means clustering是最着名的例子之一。也许有一种方法可以适应k-means或类似于这个问题的东西，我不知道。

与levenshtein距离的小组

1 个答案: