如何在sql中找到几乎相似的记录?

时间:2017-12-18 22:02:37

标签: sql algorithm postgresql similarity

这是搜索记录:

A = {
    field1: value1,
    field2: value2,
    ...
    fieldN: valueN
}

我在数据库中有很多这样的记录。

如果这些记录中的偶数N-M字段相等,则其他记录(B)几乎与记录A匹配。这是一个例子,M = 2:

B = {
    field1: OTHER_value1,
    field2: OTHER_value2,
    field3: value3,
    ...
    fieldN: valueN
}

如果可以是任何领域,不仅仅是第一个。

我可以进行非常大的组合SQL查询,但可能还有更漂亮的解决方案。

P.S。:我的数据库是PostgreSQL。

2 个答案:

答案 0 :(得分:3)

这样的搜索条件将无法使用任何索引,但可以完成...

SELECT
  *
FROM
  yourTable
WHERE
  N-M <= CASE WHEN yourTable.field1 = searchValue1 THEN 1 ELSE 0 END
       + CASE WHEN yourTable.field2 = searchValue2 THEN 1 ELSE 0 END
       + CASE WHEN yourTable.field3 = searchValue3 THEN 1 ELSE 0 END
       ...
       + CASE WHEN yourTable.fieldN = searchValueN THEN 1 ELSE 0 END

同样,如果您的搜索条件位于另一个表格中......

SELECT
  *
FROM
  yourTable
INNER JOIN
  search
    ON N-M <= CASE WHEN yourTable.field1 = search.field1 THEN 1 ELSE 0 END
            + CASE WHEN yourTable.field2 = search.field2 THEN 1 ELSE 0 END
            + CASE WHEN yourTable.field3 = search.field3 THEN 1 ELSE 0 END
            ...
            + CASE WHEN yourTable.fieldN = search.fieldN THEN 1 ELSE 0 END

(您需要自己填充N-M的值)

<强> 编辑:

更长时间的方法,可以某些使用索引......

SELECT
    id,  -- your table would need to have a primary key / identity column
    MAX(field1)   AS field1,
    MAX(field2)   AS field2,
    MAX(field3)   AS field3,
    ...
    MAX(fieldN)   AS fieldN
FROM
(
    SELECT * FROM yourTable WHERE field1 = searchValue1
    UNION ALL
    SELECT * FROM yourTable WHERE field2 = searchValue2
    UNION ALL
    SELECT * FROM yourTable WHERE field3 = searchValue3
    ...
    SELECT * FROM yourTable WHERE fieldN = searchValueN
)
    AS unioned_seeks
GROUP BY
    id
HAVING
    COUNT(*) >= N-M

如果每个字段都有一个索引,并且您希望每个字段的匹配数相对较少,则 可能 的性能优于第一个选项非常重复的代码。

答案 1 :(得分:3)

我会使用is not distinct from来处理NULL值。

您也可以使用Postgres简写来简化逻辑。一种方法是:

where ( (a.field1 is not distinct from b.field1)::int +
        (a.field2 is not distinct from b.field2)::int +
        . . .
        (a.fieldn is not distinct from b.fieldn)::int +
      ) >= N - M

我认为仅使用M更容易表达。所以,只看一下不同的字段:

where ( (a.field1 is distinct from b.field1)::int +
        (a.field2 is distinct from b.field2)::int +
        . . .
        (a.fieldn is distinct from b.fieldn)::int +
      ) <= M

对您的数据执行此操作需要cross join这非常昂贵。