Question

PostgreSQL的to_tsvector函数非常有用，但就我的数据集而言，它比我想要的更多。

例如：

select * 
from to_tsvector('english', 'This is my favourite game. I enjoy everything about it.');

生成：'enjoy':7 'everyth':8 'favourit':4 'game':5

我对被过滤掉的停止词感到不满意，这很好。但有些词语完全毁了，比如everything和favourite。

有没有办法修改此行为，还是有不同的功能可以执行此操作？

PS：是的，我可以编写我自己的查询（我也有），但我想要一个更快的方法。

Answer 1

您所看到的以及您不想要的行为是＆＃34;阻止＆＃34;。如果您不想要，则必须使用与to_tsvector不同的字典。＆＃34;简单＆＃34;字典不会阻塞，所以它应该适合你的用例。

select * 
from to_tsvector('simple', 'This is my favourite game. I enjoy everything about it.');

产生以下输出

＆＃39;关于＆＃39;：＆＃39;享受＆＃39;：7＆＃39;所有＆＃39;：8＆＃39;最喜欢＆＃39;：4＆＃39;游戏＆＃39 ;：5＆＃39;我＆＃39;：＆＃39;：＆＃39;：＆＃39; 10＆＃39;：＆＃39;我的＆＃39;：＆＃39;这＆＃39; ;：1

如果您仍想删除停用词，则必须根据我的意见定义自己的词典。请参阅下面的示例，但您可能需要阅读documentation以确保这完全符合您的要求。

CREATE TEXT SEARCH DICTIONARY only_stop_words (
    Template = pg_catalog.simple,
    Stopwords = english
);
CREATE TEXT SEARCH CONFIGURATION public.only_stop_words ( COPY = pg_catalog.simple );
ALTER TEXT SEARCH CONFIGURATION public.only_stop_words ALTER MAPPING FOR asciiword WITH only_stop_words;
select * 
from to_tsvector('only_stop_words', 'The This is my favourite game. I enjoy everything about it.');

＆＃39;享受＆＃39;：8＆＃39;所有＆＃39;：9＆＃39;最喜欢的＆＃39;：5＆＃39;游戏＆＃39;：6

PostgreSQL的to_tsvector函数可以返回令牌/单词而不是lexemes吗？

1 个答案: