Question

我有一个书名表 - 其中大部分都是针对不同版本多次出版。许多标题被错误导入，缺少非ASCII字符，即“Lamétamorphose”变成“La m？tamorphose”，有时é变成了空格或者只是从字符串中移除。

表格

editionid | bookid | title
--------------------------------------------
1         | 1      | Elementarne čestice
2         | 1      | Elementarne ?estice
3         | 1      | Elementarne estice
4         | 1      | Las partículas elementales
5         | 2      | Schöne neue Welt
6         | 2      | Sch ne neue Welt

我想通过剥离标题的非ASCII并与同一本书的其他标题进行比较来识别错误的标题。如果有匹配，我发现了一个有缺陷的标题。

结果：

o.title (flawed)    | e.title (good)
-----------------------------------
Elementarne ?estice | Elementarne čestice
Elementarne estice  | Elementarne čestice
Sch ne neue Welt    | Schöne neue Welt

表格相当大，但由于性能不是关键所以我只需要这样做。

我的方法：

select distinct on (o.editionid) o.title, e.title
from editions o
inner join editions e on (o.bookid = e.bookid)
where o.bookid between 1 and 1000
    and e.title !~ '^[ -~]*$' -- only for performance
    and ((
      e.title like '%Þ%' and (o.title = regexp_replace(e.title, '[Þ]', '?') or o.title = regexp_replace(e.title, '[Þ]', ' ') or o.title = regexp_replace(e.title, '[Þ]', ''))
    ) or (
      e.title like '%ß%' and (o.title = regexp_replace(e.title, '[ß]', '?') or o.title = regexp_replace(e.title, '[ß]', ' ') or o.title = regexp_replace(e.title, '[ß]', ''))
    ) or (
      e.title like '%à%' and (o.title = regexp_replace(e.title, '[à]', '?') or o.title = regexp_replace(e.title, '[à]', ' ') or o.title = regexp_replace(e.title, '[à]', ''))
    .
    .
    .
    ))

到目前为止有效，但似乎无法单独添加所有非ASCII字符。有没有人知道一种更一般的方法，它一次覆盖所有非ASCII字符？

第二 - 如果两个不同的角色被剥离并且我不知道如何解决它，它就不起作用。

第三，但也许是不可能的 - 通常只有一些非ASCII用于转换，但并非全部，即“WeißeNächte”变成了“WeieNächte” - 如果这些也可以被覆盖，那就太棒了。

Answer 1

经过一番摆弄后，最终并没有那么难：

select distinct on (o.editionid) o.title as flawed, e.title as good
from editions o
inner join editions e on (o.bookid = e.bookid)
where o.bookid between 0 and 10000
    and e.title ~ '[^\x00-\x7F]'
    and (
            o.title = regexp_replace(e.title, '[^\x00-\x7F]', '?', 'g') 
            or o.title = regexp_replace(e.title, '[^\x00-\x7F]', ' ', 'g')
        )

regexp_replace(e.title, '[^\x00-\x7F]', '?', 'g')是\x00-\x7F所有Unicode字符不在ASCII方案中且'g'继续在同一字符串中搜索更多匹配的关键。

如何找到一个被剥离非ASCII字符的字符串副本

1 个答案: