Question

给出以下表格（每个包含约2 mio数据集）：

movie(title, genre, price)
ground_truth_movie(title, genre)

电影：

|           title           |   genre   | price |
|***************************|***********|*******|
|       Bria nsau r us      |  History  | 10.2  |
|  BoJch o BillyOut6 tt e d | Animation | 10.2  |
|M on l igh t on th6  R ange|  Fantasy  | 10.2  |
|    Pret  a  Teleport er   | Animation | 10.2  |
|           ...             |    ...    |  ...  |

ground_truth_movie ：

|           title           |   genre   |
|***************************|***********|
|       Briansaurus         |  History  |
|  Broncho Billy Outwitted  | Animation |
|  Moonlight on the Range   |  Fantasy  |
|     Pret a Teleporter     | Animation |
|           ...             |    ...    |

我的任务是根据ground_truth_movie表中的标题清除影片表中的标题。

由于电影表中的标题写得完全错误，我在这个案例中找到的最佳比较函数是levenshtein。到目前为止我有以下SQL脚本：

SELECT m.title, (
SELECT g.title
  FROM ground_truth_movie g
  ORDER BY levenshtein(g.title, m.title) ASC, g.title
LIMIT 1
)
FROM movie m

由于这个脚本在缩小的数据集上花费了很长时间，我想到了以某种方式加快它的速度......？

还有其他比较列与基础事实的概念吗？

我是这个领域的新手，我认为我的第一次尝试不是最好的;）

感谢您的帮助

Postgres将内容与地面真值表进行比较

0 个答案: