Question

我无法透露实际数据，因此我仅举一个例子。我有两个表，一个是字典表，其中包含标题的ID。第二个表是进入数据库的新数据，并且没有ID，我需要通过检查字典表来更新新数据的ID（如果我已经在其中存在类似内容），或者用新值更新字典并为其获取新的ID，并为新数据更新相同的ID。我希望第二张表中的const User = sequelize.define("users", { id: { type: Sequelize.INTEGER, autoIncrement: true, allowNull: false, primaryKey: true, }, firstName: { type: Sequelize.STRING(50), allowNull: false, }, lastName: { type: Sequelize.STRING(50), allowNull: false, }, fullName: { type: DataTypes.VIRTUAL, get() { return `${this.firstName} ${this.lastName}`; } }, email: { type: Sequelize.INTEGER(30), allowNull: false, }, });列被更新为。

Expected ID

Answer 1

这似乎是pg_trgm的任务。它为您提供%运算符，如果两个字符串“足够接近”，则返回true。您可以通过更改pg_trgm.similarity_threshold来调整足够接近的含义。您可以创建索引以加快此操作。设置的sameity_threshold越高，您可能获得的加速度就越多。

如果我首先将表显示为foo1，并将所有所需输出中的不同标题作为foo2，则此查询将给出合理的结果：

select *, foo1.title <-> foo2.title as distance from foo2 left join foo1 on foo1.title % foo2.title;

                title                 |   id   |       title       | distance  
--------------------------------------+--------+-------------------+-----------
 The Hunger Games: Mockingjay Part I  |      2 | The Hunger Games  | 0.5142857
 The Hunger Games: Mockingjay Part II |      2 | The Hunger Games  | 0.5277778
 John Wick (2014)                     |      3 | John Wick         | 0.3333333
 John Wick Chapter 2                  |      3 | John Wick         |       0.5
 Alien                                |      1 | Aliens            |     0.375
 Alien                                |      4 | Alien vs Predator | 0.6666666
 Aliens                               |      1 | Aliens            |         0
 Alien 3                              |      1 | Aliens            |       0.5
 Alien 3                              |      4 | Alien vs Predator |       0.7
 Lord of the Rings                    | (null) | (null)            |    (null)

如果每个foo2只需要一个输出行，只显示“最佳”匹配项，则可以使用LEFT JOIN LATERAL：

select *, a.title <-> foo2.title as distance from foo2 left join lateral 
    (select * from foo1 where foo1.title % foo2.title order by foo1.title <-> foo2.title limit 1) a
    on true;
                title                 |   id   |      title       | distance  
--------------------------------------+--------+------------------+-----------
 The Hunger Games: Mockingjay Part I  |      2 | The Hunger Games | 0.5142857
 The Hunger Games: Mockingjay Part II |      2 | The Hunger Games | 0.5277778
 John Wick (2014)                     |      3 | John Wick        | 0.3333333
 John Wick Chapter 2                  |      3 | John Wick        |       0.5
 Alien                                |      1 | Aliens           |     0.375
 Aliens                               |      1 | Aliens           |         0
 Alien 3                              |      1 | Aliens           |       0.5
 Lord of the Rings                    | (null) | (null)           |    (null)

如何用新生成的ID替换“ id”列（没有足够接近的匹配项）中的NULL是一个单独的问题，您应该分别询问单独的问题。

对于任何实际大小的数据集，您不可能仅仅盲目接受上述查询所产生的任何结果，至少如果您想要高质量的结果，则不可能。相反，您可以让计算机生成上述建议，然后（在便捷的界面中）将其提供给人类以供批准，拒绝或进一步调查。

使用Postgres检查字符串中的相似性

1 个答案: