使用Postgresql进行多语言全文搜索

时间:2019-07-02 10:40:26

标签: postgresql search full-text-search multilingual

用户输入可以使用英语或意大利语。数据既有英语,也有(大部分)意大利语。以下是我的查询(似乎正在运行),我的问题是这是否是处理未知语言输入的正确方法。 (在示例中,用户输入单词“葡萄酒”):

    SELECT id, name
    FROM (
        SELECT p.id, p.name,
                to_tsvector('italian', p.name) || --some data are only in italian
                to_tsvector('italian', cat.category) || 
                to_tsvector((CASE WHEN de.language = 'ITA' THEN 'italian' ELSE 'english' END)::regconfig, coalesce(string_agg(de.descr, ' '))) as document 

        FROM myschema.product p
        INNER JOIN myschema.disc d ON d.id_disc = p.id_disc
        INNER JOIN myschema.disc_city dc ON dc.id_disc = d.id_disc
        INNER JOIN myschema.city c ON c.id_city = dc.id_city 
        INNER JOIN myschema.category cat ON cat.id_category = d.id_category
        INNER JOIN myschema.product_desc pd ON pd.id = p.id --One p.id to Many pd.id, a product can have multiple descriptions
        INNER JOIN myschema.descr de ON de.id_descr = pd.id_descr
        GROUP BY p.id, p.name, cat.category, de.language    
    ) p_search
--handling input 'wine' of unknown language (could be too the italian 'vino')
    WHERE p_search.document @@ to_tsquery('italian', 'wine') OR
        p_search.document @@ to_tsquery('english', 'wine');
    GROUP BY id, name

2 个答案:

答案 0 :(得分:0)

您可以使用“简单”字典进行测试:

SELECT to_tsvector('english', 'The wine is good');
SELECT to_tsvector('italian', 'The wine is good');
SELECT to_tsvector('simple', 'The wine is good');

SELECT to_tsvector('english', 'Il vino è buono');
SELECT to_tsvector('italian', 'Il vino è buono');
SELECT to_tsvector('simple', 'Il vino è buono');

答案 1 :(得分:0)

使用PostgreSQL,您可以创建自己的字典:

CREATE TEXT SEARCH DICTIONARY public.wine_dict (
TEMPLATE = pg_catalog.simple,
STOPWORDS = wine
);

文件wine.stop包含字典的停用词:

wine
merlot 
carmenere
...

此文件必须位于$ SHAREDIR / tsearch_data / wine.stop中 使用pg_config --sharedir查找$ SHAREDIR

然后创建SEARCH DICTIONARY:

 CREATE TEXT SEARCH DICTIONARY public.wine_dict (
 TEMPLATE = pg_catalog.simple,
 STOPWORDS = wine
 );

 CREATE TEXT SEARCH CONFIGURATION wine_dict(parser = default);

 ALTER TEXT SEARCH CONFIGURATION wine_dict
 ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
 word, hword, hword_part
 WITH wine_dict;

SELECT to_tsvector('wine_dict', 'The wine is good');

result: 
'good':4 'is':3 'the':1