Question

我正在尝试获取一个最佳匹配项列表，列出标签。以下数据：

DROP TABLE IF EXISTS testing_items;
CREATE TEMP TABLE testing_items(
    id bigserial primary key,
    tags text[]
);
CREATE INDEX ON testing_items using gin (tags);

INSERT INTO testing_items (tags) VALUES ('{123,456, abc}');
INSERT INTO testing_items (tags) VALUES ('{222,333}');
INSERT INTO testing_items (tags) VALUES ('{222,555}');
INSERT INTO testing_items (tags) VALUES ('{222,123}');
INSERT INTO testing_items (tags) VALUES ('{222,123,555,666}');

我有标签222,555 and 666。我怎样才能得到这样的清单？

ps：必须使用GIN索引，因为会有大量记录。

id         matches
--         -------
5          3
3          2
2          1
4          1

修改 Id 1不应该在列表中，因为不匹配任何标记

1          0

Answer 1

Unnest标签，过滤未连接的元素并聚合剩余的元素：

select id, count(distinct u) as matches
from (
    select id, u
    from testing_items,
    lateral unnest(tags) u
    where u in ('222', '555', '666')
    ) s
group by 1
order by 2 desc

 id | matches 
----+---------
  5 |       3
  3 |       2
  2 |       1
  4 |       1
(4 rows)

考虑到所有答案，似乎这个查询结合了每个问题的优点：

select id, count(*) 
from testing_items,
unnest(array['11','5','8']) u
where tags @> array[u] 
group by id 
order by 2 desc, 1;

在爱德华多的测试中表现最佳。

Answer 2

这是使用unexst的两分钱，数组包含：

select id, count(*) 
from (
  select unnest(array['222','555','666']) as tag, * 
  from testing_items
) as w 
where tags @> array[tag] 
group by id 
order by 2 desc

结果：

+------+---------+ | id | count | |------+---------| | 5 | 3 | | 3 | 2 | | 2 | 1 | | 4 | 1 | +------+---------+

Answer 3

这是我用1000万条记录测试的，每条记录有3个标签，每个标签的随机数在0到100之间：

CREATE DATABASE d20170228 ;
USE d20170228 ; 
CREATE TABLE GameLog
( playerid VARCHAR(5) DEFAULT '12345'
, die1 TINYINT
, die2 TINYINT
, die3 TINYINT
);
INSERT INTO GameLog (die1,die2,die3)
VALUES (3,0,0),(2,1,0),(4,3,3),(3,3,3),(0,0,0),(4,4,4),(5,4,0),(0,0,2)  
;
SELECT (3+2+1+4+3+3+3+3+3+4+4+4+5+4+2)/15  AS manual_avg

我已经BEGIN; LOCK TABLE testing_items IN EXCLUSIVE MODE; INSERT INTO testing_items (tags) SELECT (ARRAY[trunc(random() * 99 + 1), trunc(random() * 99 + 1), trunc(random() * 99 + 1)]) FROM generate_series(1, 10000000) s; COMMIT;没有等待大回复。

@paqash和@klin解决方案具有相似的性能。我的笔记本电脑在12秒内使用标签11,8和5运行它们。

但这在4.6秒内完成：

ORDER BY c DESC, id LIMIT 5

但我仍然认为有更快的方法。

Answer 4

在此处查看：http://rextester.com/UTGO74511

如果您使用的是GIN索引，请使用＆amp;＆amp;：

select *
from testing_items
where not (ARRAY['333','555','666'] && tags);


id | tags
--- -------------
 1  123456abc
 4  222123

Postgresql - 按数组中的匹配数排序

4 个答案: