Question

当我将 fuzzystrmatch levenshtein 功能与变音符号一起使用时，它会返回错误的/多字节忽略结果：

select levenshtein('ą', 'x');
levenshtein 
-------------
       2

（注意：第一个字符是下面带有变音符号的'a'，在我将其复制到此处后无法正确显示）

fuzzystrmatch 文档（https://www.postgresql.org/docs/9.1/fuzzystrmatch.html）警告：

目前，soundex，metaphone，dmetaphone和dmetaphone_alt函数不适用于多字节编码（例如UTF-8）。

但是因为它没有命名 levenshtein 函数，所以我想知道是否有 levenshtein 的多字节感知版本。

我知道我可以使用 unaccent 函数作为解决方法，但我需要保留变音符号。

Answer 1

注意：@Nick Barnes在他的answer中提出了一种解决方案，将其建议为related question。

带变音符号的'a'是字符序列，即 a 和组合字符的变音符号̨：<?xml version="1.0" encoding="utf-8"?> <LinearLayout xmlns:android="http://schemas.android.com/apk/res/android" xmlns:app="http://schemas.android.com/apk/res-auto" android:layout_width="match_parent" android:layout_height="@dimen/nav_header_height" android:background="@drawable/side_nav_bar" android:paddingBottom="@dimen/activity_vertical_margin" android:paddingLeft="@dimen/activity_horizontal_margin" android:paddingRight="@dimen/activity_horizontal_margin" android:paddingTop="@dimen/activity_vertical_margin" android:theme="@style/ThemeOverlay.AppCompat.Dark" android:orientation="vertical" android:gravity="bottom"> <ImageView android:layout_width="wrap_content" android:layout_height="wrap_content" android:paddingTop="@dimen/nav_header_vertical_spacing" app:srcCompat="@mipmap/ic_launcher_round" android:contentDescription="@string/nav_header_desc" android:id="@+id/imageView"/> <TextView android:layout_width="match_parent" android:layout_height="wrap_content" android:paddingTop="@dimen/nav_header_vertical_spacing" android:text="@string/nav_header_title" android:textAppearance="@style/TextAppearance.AppCompat.Body1"/> <TextView android:layout_width="wrap_content" android:layout_height="wrap_content" android:text="@string/nav_header_subtitle" android:id="@+id/textView"/> </LinearLayout>

有一个等效的预组合字符±：E'a\u0328'

一种解决方案是normalise Unicode字符串，即在比较它们之前将组合字符序列转换为预组合字符。

不幸的是，Postgres似乎没有内置的Unicode规范化功能，但是您可以通过PL/Perl或PL/Python语言扩展名轻松地访问它。

例如：

E'\u0105'

现在，由于使用create extension plpythonu; create or replace function unicode_normalize(str text) returns text as $$ import unicodedata return unicodedata.normalize('NFC', str.decode('UTF-8')) $$ language plpythonu;将字符序列E'a\u0328'映射到等效的预组合字符E'\u0105'上，因此levenshtein距离是正确的：

unicode_normalize

有多字节感知的Postgresql Levenshtein吗？

1 个答案: