Question

我的表：

CREATE TABLE `beer`.`matches` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `hashId` int(10) unsigned NOT NULL,
  `ruleId` int(10) unsigned NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

如果哈希匹配规则，则此表中有一个条目。

1）计算每个唯一ruleId有多少个hashId（AKA“每个规则匹配多少个哈希”）

SELECT COUNT(*), ruleId FROM `beer`.`matches` GROUP BY ruleId ORDER BY COUNT(*)

2）选择10个最佳规则（ruleIds），即选择组合的10个规则匹配最大数量的唯一哈希值。这意味着，如果另一个规则涵盖所有相同的哈希值，那么匹配大量哈希值的规则不一定是一个好的规则。基本上我想选择捕获最独特的hashIds的10个ruleIds。

编辑：基本上我在PHP / SQL here中有一个次优解决方案，但根据数据，它不一定能给我问题2的最佳答案。我对更好的解决方案感兴趣。阅读评论以获取更多信息。

Answer 1

我认为您的问题是"knapsack problem"的变体。

我认为您已经明白，您不能像其他答案所建议的那样，ruleIds匹配hashIds匹配最多ruleIds，因为虽然每个hashIds匹配都说100 { {1}}，它们可能都匹配相同 100 hashIds ...但如果您选择了其他10个ruleIds只匹配25 hashIds，但每个hashIds匹配的每个ruleId都是唯一的，您最终会得到更多唯一的hashIds。

要解决此问题，您可以先选择ruleId匹配最多hashIds的匹配项，然后选择与ruleId匹配的hashIds匹配项hashIds在ruleIds匹配之前的ruleIds ...继续此过程，直到您选择了10 ruleIds。

您的数据分布中可能仍然存在异常，这会导致无法生成ruleIds的最佳集合...因此，如果您想要发疯，可以考虑实施遗传算法以尝试改进你的10套ruleIds的“健康”。

这不是SQL特别适合处理but here's an example of the knapsack problem being solved with a genetic algorithm written in SQL(!)

的任务

修改

这是一个未经测试的解决方案实施，其中ruleId一次被选中1个，每次迭代选择hashIds具有最独特的ruleIds之前的任何内容-------------------------------------------------------------------------- -- Create Test Data -------------------------------------------------------------------------- create create matches ( id int(10) unsigned not null auto_increment, hashId int(10) unsigned not null, ruleId int(10) unsigned not null, primary key (id) ); insert into matches (hashid, ruleid) values (1,1), (2,1), (3,1), (4,1), (5,1), (6,1), (7,1), (8,1), (9,1), (10,1), (1,2), (2,2), (3,2), (4,2), (5,2), (6,2), (7,2), (8,2), (9,2), (10,2), (1,3), (2,3), (3,3), (4,3), (5,3), (6,3), (7,3), (8,3), (9,3), (10,3), (1,4), (2,4), (3,4), (4,4), (5,4), (6,4), (7,4), (8,4), (9,4), (10,4), (1,5), (2,5), (3,5), (4,5), (5,5), (6,5), (7,5), (8,5), (9,5), (10,5), (1,6), (2,6), (3,6), (4,6), (5,6), (6,6), (7,6), (8,6), (9,6), (10,6), (1,7), (2,7), (3,7), (4,7), (5,7), (6,7), (7,7), (8,7), (9,7), (10,7), (1,8), (2,8), (3,8), (4,8), (5,8), (6,8), (7,8), (8,8), (9,8), (10,8), (1,9), (2,9), (3,9), (4,9), (5,9), (6,9), (7,9), (8,9), (9,9), (10,9), (11,10), (12,10), (13,10), (14,10), (15,10), (11,11), (12,11), (13,11), (14,11), (15,11), (16,12), (17,12), (18,12), (19,12), (20,12), (21,13), (22,13), (23,13), (24,13), (25,13), (26,14), (27,14), (28,14), (29,14), (30,14), (31,15), (32,15), (33,15), (34,15), (35,15), (36,16), (37,16), (38,16), (39,16), (40,16), (41,17), (42,17), (43,17), (44,17), (45,17), (46,18), (47,18), (48,18), (49,18), (50,18), (51,19), (52,19), (53,19), (54,19), (55,19), (56,20), (57,20), (58,20), (59,20), (60,20) -------------------------------------------------------------------------- -- End Create Test Data -------------------------------------------------------------------------- create table selectedRules ( ruleId int(10) unsigned not null ); set @rulesSelected = 0; while (@rulesSelected < 10) do insert into selectedRules (ruleId) select m.ruleId from matches m left join ( select distinct m2.hashId from selectedRules sr join matches m2 on m2.ruleId = sr.ruleId ) prev on prev.hashId = m.hashId where prev.hashId is null group by m.ruleId order by count(distinct m.hashId) desc limit 1; set @rulesSelected = @rulesSelected + 1; end while; select ruleId from selectedRules;其他选定的{{1}}：

{{1}}

Answer 2

如果您真的想找到最佳解决方案（最佳解决方案），问题是您必须检查10个ruleIds的所有可能组合，并找出每种可能组合返回的hashId数。问题是组合的数量大致是不同的规则数^ 10（事实上，如果你认为你不能在组合中重复相同的ruleId，那么数字就会更小......它是m元素的组合小组10）。

注意：确切地说，可能的组合数是

m！/（n！（m-n）！）=＆gt; m！/（10！（m-10！））在哪里！是阶乘：m！ = m * m-1 * m-2 ... * 3 * 2 * 1

要执行此组合，您必须自己加入表格10次，不包括以前的规则组合，有点像这样：

select m1.ruleid r1, m2.ruleid r2, m3.ruleid r3 ...
from matches m1 inner join matches m2 on m2<>m1 
   inner join matches m3 on m3 <> m1 and m3 <> m2
     ...

然后你必须找到最高的

数

select r1, r2, r3..., count(distinct hashid) 
from ("here the combinations of 10 ruleIds define above") G10
inner join M
  on ruleid = r1 or ruleid = r2 or ruleid=r3...
group by r1, r2, r3...

这个巨大的查询需要花费大量时间才能运行。

可以有更快的程序，可以为您提供次优的结果。

一些优化：

这可以在某种程度上进行优化，具体取决于数据形状，寻找等于或包含在其他组中的组。这将需要少于（m *（m + 1））/ 2次操作，与其他数字相比，这是一个大问题，特别是如果很可能找到几个可以丢弃的组，这将降低m。无论如何，主力仍然是一个巨大的成本。

Answer 3

虽然我来自PostgreSQL世界，但我发现这个问题非常有趣，我花时间去研究它。

我将整个过程分成两个子程序：

首先，需要一个子查询（或函数），对于给定的ruleId组合（数组）将返回所有可能的（数组）+ ruleId条目，其中包含为条目找到的唯一hashId（count）的数量; < / LI>
然后，应该从＃1查询max（count）并从＃1获得array + ruleId组合的列表。我为此使用了递归函数。如果当前递归级别匹配所需的ruleIds数量（10个问题），则返回找到的数组+ ruleId组合，否则递归进入同一步骤（＃2），将找到的组合作为输入。

因此，第二个函数将返回所有组合，这些组合将为给定的ruleId计数提供最大量的唯一hashId。

这是将创建测试设置的代码，PostgreSQL 9.1已经过测试。由于最初的问题是MySQL，我将评论那里发生的事情：

create table matches (
  id        int4   not null,
  hashId    int4   not null,
  ruleId    int4   not null,
  primary key (id)
);

insert into matches
SELECT generate_series(1,200), (random()*59+1)::int4, (random()*19+1)::int4;
-- This query will generate a 200-rows table, with:
-- - first column having values in 1-200 range (id)
-- - second column will have random numbers in 1-60 range (hashId)
-- - third column will have random numbers in 1-20 range (ruleId)

阶段1的功能（非常简单）：

CREATE OR REPLACE FUNCTION count_matches(i_array int4[],
    OUT arr int4[], OUT cnt int4) RETURNS SETOF record
AS $$
DECLARE
    rec_o   record;
    rec_i   record;

BEGIN
    -- in the outer loop, we're going over all the combinations of input array
    -- with the ruleId appended
    FOR rec_o IN SELECT DISTINCT i_array||ruleId AS rules
        FROM matches ORDER BY 1
    LOOP
        -- in the inner loop we're counting the distinct hashId combinations
        -- for the outer loop provided array
        -- and returning the new array + count
        FOR rec_i IN SELECT count(distinct hashId) AS cnt
            FROM matches WHERE ruleId = ANY(rec_o.rules)
        LOOP

            arr := rec_o.rules;
            cnt := rec_i.cnt;
            RETURN NEXT ;
        END LOOP;
    END LOOP;

    RETURN ;
END;
$$ LANGUAGE plpgsql;

如果您将空数组作为此函数的输入，您将得到与初始问题的情况＃1相同的结果：

SELECT COUNT(*), ruleId FROM `beer`.`matches` GROUP BY ruleId ORDER BY COUNT(*);
-- both queries yields same results
SELECT cnt, arr FROM count_matches(ARRAY[]::int4[]);

现在主要的工作功能：

-- function receives 3 parameters, 2 of them have default values
-- which makes it possible to query: max_matches(10)
-- to obtain results from the initial question
CREATE OR REPLACE FUNCTION max_matches(maxi int4,
    arri int4[] DEFAULT array[]::int4[],
    curi int4 DEFAULT 1, OUT arr int4[]) RETURNS SETOF int4[]
AS $$
DECLARE
    maxcnt  int4;
    a       int4[];
    b       int4[];

BEGIN
    -- Fall out early for "easy" cases
    IF maxi < 2 THEN
        RAISE EXCEPTION 'Too easy, do a GROUP BY query instead';
    END IF;

    a = array[]::int4[];

    -- first, we find out what is the maximal possible number of hashIds
    -- on a given level
    SELECT max(cnt) INTO maxcnt FROM count_matches(arri);
    -- then we check each combination that yield the found number
    -- of unique hashIds
    FOR arr IN SELECT cm.arr FROM count_matches(arri) cm
       WHERE cm.cnt = maxcnt
    LOOP
        -- if we're on the deepest level of recursion,
        -- we just return back the found combination
        IF curi = maxi THEN
            RETURN NEXT ;
        ELSE
            -- otherwise we ask further down
            FOR b IN SELECT * FROM max_matches(maxi, arr, curi+1) LOOP
            -- this loop and IF clause are required to eliminate
            -- equal arrays, so that if we get {6,14} and {14,6} returned
            -- we will use only one of the two, as they're the same
                IF NOT a @> b THEN
                    a = array_cat(a, b);
                    RETURN QUERY SELECT b;
                END IF;
            END LOOP;
        END IF;
    END LOOP;

    RETURN ;
END;
$$ LANGUAGE plpgsql;

不幸的是，这种方法非常耗时。对于我的测试设置，我有以下性能，对于200行“大”表，花费8秒似乎有点过分。

select * from max_matches(10);
             arr
-----------------------------
 {6,14,4,16,8,1,7,10,11,18}
 {6,14,4,16,8,1,7,11,12,18}
 {6,14,4,16,8,7,10,11,15,18}
 {6,14,4,16,11,10,1,7,18,20}
(4 rows)

Time: 8034,700 ms

我希望你不介意我跳进这个问题。我也希望你会发现我的答案对你的目的至少有部分有用：）

感谢您提出的问题，我已经度过了非常愉快的时光！

Answer 4

我认为最适合这种方法的方法是基于与多元变量辅因子分析使用的统计技术相同的逻辑/方法。

也就是说，不是试图从现有规则中解决“10个因素（或你的问题的'规则'）的组合的固有组合问题，而是最好地满足一些标准？”，它逐渐回答更容易问题“鉴于我已经拥有什么，额外的因素（'规则'），最能提高标准的实现程度？”

程序，它是这样的：首先，找到具有匹配它的最多（不同）哈希的规则。不要担心与其他规则重叠，只需找到最好的规则。添加到已选择规则的列表（或表）。

现在，根据您已有的规则，找到 next-best 规则。换句话说，找到与大多数哈希匹配的规则，排除已经与已选择的规则匹配的任何哈希值。将此新规则添加到您已选择的规则列表中，并重复直到您有10条规则。

因此，这种方法基本上避免了通过寻找增量相对/局部最佳解决方案来尝试找到绝对的，全局最佳解决方案的固有组合问题。这种方法中的一些要点：

它是O（n * k），其中“k”是您要查找的规则数。组合方法往往是非多项式，如O（2 ^ n）或O（n！），这在性能方面是非常不理想的。
这种方法可能不会为您的标准提供绝对 *最佳* 10规则。但是，根据我的经验，在这类问题的真实案例中，它往往很好地非常。通常最多只有一两个规则，绝对是最好的10个。
增量搜索的SQL代码非常简单（您已经拥有了大部分内容）。但是实际执行N = 10次的SQL代码本质上是程序性的，因此需要SQL的标准较少/更特殊的部分（翻译：我知道如何在TSQL中执行，但不能在MySql中执行）。
< / LI>

Answer 5

这是一个可能足够好的解决方案。索引和/或手动创建的缓存表可能有助于提高性能数据，尽管在人口稀少的表格中，它可以立即生效。

这个想法非常简单：创建一个视图以明确显示所有可能性，然后将所有这些可能性结合起来并通过排序找到最佳。允许使用相同规则的组合，因为某些规则本身可能比其他规则更有效。

基于类似于上面描述的表，使用名为“id”，“hash_id”和“rule_id”的列，使用以下选择创建帮助器视图（这样更容易测试/调试）：

SELECT `t1`.`hash_id` AS `h1`,`t2`.`hash_id` AS `h2`,`t3`.`hash_id` AS `h3`,`t1`.`rule_id` AS `r1`,`t2`.`rule_id` AS `r2`,`t3`.`rule_id` AS `r3` from (`hashTable` `t1` join `hashTable` `t2` join `hashTable` `t3`)

上面的视图设置为创建三重连接表。您可以将t4.hash_id as h4,t4.rule_id as r4添加到SELECT，join hashTable t4添加到FROM以添加第四个连接，依此类推，直到10。

创建视图后，以下查询提供了2个最佳规则的组合及其明确显示的哈希覆盖：

select group_concat(distinct h1),concat(r1, r2) from (select distinct h1,r1,r2 from hashView union distinct select distinct h2,r1,r2 from hashView) as uu group by concat(r1,r2)

如果您不需要查看哈希覆盖范围，可能会更好：

select count(distinct h1) as cc,concat(r1, r2) from (select distinct h1,r1,r2 from hashView union distinct select distinct h2,r1,r2 from hashView) as uu group by concat(r1,r2) order by cc

通过将h3和r3添加到联合并使用它进行分组来添加第三个规则匹配很简单：

select count(distinct h1),concat(r1, r2, r3) from (select distinct h1,r1,r2,r3 from hashView union distinct select distinct h2,r1,r2,r3 from hashView union distinct select distinct h3,r1,r2,r3 from hashView) as uu group by concat(r1,r2,r3)

如果您不需要选择要匹配的顶级规则的数量，您可以在View本身中执行concat（）并在联合查询上节省一些时间。

可能的性能提升是消除置换规则ID。

以上所有内容仅使用一位数的规则ID进行测试，因此您应该使用concat_ws（）来代替concat（），以便进行预先查看的视图：

select `t1`.`hash_id` AS `h1`,`t2`.`hash_id` AS `h2`,`t3`.`hash_id` AS `h3`,concat_ws(",",`t1`.`rule_id`,`t2`.`rule_id`,`t3`.`rule_id`) AS `r` from (`hashTable` `t1` join `hashTable` `t2` join `hashTable` `t3`)

然后是联合查询：

select count(distinct h1) as cc,r from (select distinct h1,r from hashView union distinct select distinct h2,r from hashView union distinct select distinct h3,r from hashView) as uu group by r order by cc

请知道这是否解决了手头的问题，或者是否存在之前未披露的其他限制。

根据规则和哈希的数量，您还可以始终反转规则＆lt; - ＆gt;哈希关系，而是创建基于哈希的视图。

最好的想法可能是将这种方法与现实生活方式相结合。

SQL选择10个记录集合最符合标准的记录

5 个答案: