查找其他人查看的前N个产品(在MySQL中)

时间:2016-03-17 21:28:49

标签: mysql sql

背景

我有一个看起来像这样的product_visits表:

create table product_visits (product_id int, visitor_id int);

insert into product_visits values
  (1, 1),
  (1, 2),
  (1, 3),
  (1, 4),
  (1, 5),
  (2, 1),
  (2, 2),
  (2, 3),
  (2, 4),
  (2, 5),
  (3, 1),
  (3, 2),
  (3, 3),
  (4, 1),
  (4, 2),
  (5, 1);

| product_id | visitor_id |
|------------|------------|
|          1 |          1 |
|          1 |          2 |
|          1 |          3 |
|          1 |          4 |
|          1 |          5 |
|          2 |          1 |
|          2 |          2 |
|          2 |          3 |
|          2 |          4 |
|          2 |          5 |
|          3 |          1 |
|          3 |          2 |
|          3 |          3 |
|          4 |          1 |
|          4 |          2 |
|          5 |          1 |

我目前正在使用以下SQL选择给定产品的访问者也访问过的前2个其他产品:

SELECT a.`product_id`, count(a.`product_id`) visits
FROM `product_visits` a
INNER JOIN `product_visits` b ON a.`visitor_id` = b.`visitor_id`
WHERE b.`product_id` = ?
  AND a.`product_id` != ?
GROUP BY a.`product_id`
ORDER BY visits DESC 
LIMIT 2

例如,如果针对product_id = 1运行,我会使用以上数据获得以下结果:

| product_id | visits |
|------------|--------|
|          2 |      5 |
|          3 |      3 |

尝试一次获得一个产品的结果时,这样做还可以。

问题

我需要做的是重写上述查询,以便它适用于product_visits表格中所有产品的单个查询。我仍然需要将结果仅限于每个产品的最高n结果(例如2)。例如,根据上述数据,我希望看到的结果如下:

| target_product_id | related_product_id | visits |
|-------------------|--------------------|--------|
|                 1 |                  2 |      5 |
|                 1 |                  3 |      3 |
|                 2 |                  1 |      5 |
|                 2 |                  3 |      3 |
|                 3 |                  1 |      3 |
|                 3 |                  2 |      3 |
|                 4 |                  1 |      2 |
|                 4 |                  2 |      2 |
|                 5 |                  1 |      1 |
|                 5 |                  2 |      1 |

我最接近上述目标的尝试是使用以下代码:

SELECT a.`product_id` target_product_id, b.`product_id` related_product_id, count(*) visits
FROM `product_visits` a
INNER JOIN `product_visits` b ON a.`visitor_id` = b.`visitor_id`
WHERE b.`product_id` != a.`product_id`
GROUP BY a.`product_id`, b.`product_id`
ORDER BY target_product_id ASC, visits DESC

它给了我以下结果,但仍然缺少将结果限制为每n个最高target_product_id个匹配项:

| target_product_id | related_product_id | visits |
|-------------------|--------------------|--------|
|                 1 |                  2 |       5|
|                 1 |                  3 |       3|
|                 1 |                  4 |       2|
|                 1 |                  5 |       1|
|                 2 |                  1 |       5|
|                 2 |                  3 |       3|
|                 2 |                  4 |       2|
|                 2 |                  5 |       1|
|                 3 |                  1 |       3|
|                 3 |                  2 |       3|
|                 3 |                  4 |       2|
|                 3 |                  5 |       1|
|                 4 |                  3 |       2|
|                 4 |                  1 |       2|
|                 4 |                  2 |       2|
|                 4 |                  5 |       1|
|                 5 |                  3 |       1|
|                 5 |                  1 |       1|
|                 5 |                  4 |       1|
|                 5 |                  2 |       1|

我现在已经对这个问题感到头疼了一段时间但是我们还没有能够提出完整的解决方案。

更新#1

我在下面针对我的生产数据运行Gordon Linoff' suggested SQL - 当然是在开发数据库中。我的product_visits表中有大约260万条记录。将限制设置为2,查询运行 41.8572秒。几乎所有的时间(40.4秒)花费了复制到Tmp表

通过EXPLAIN运行SQL的输出如下:

id | select_type | table      | type   | possible_keys    | key         | key_len | ref                   | rows    | Extra                                        |
 1 | PRIMARY     | <derived2> | ALL    | NULL             | NULL        | NULL    | NULL                  | 1161898 | Using where; Using filesort                  |
 2 | DERIVED     | <derived4> | system | NULL             | NULL        | NULL    | NULL                  |       1 |                                              |
 2 | DERIVED     | <derived3> | ALL    | NULL             | NULL        | NULL    | NULL                  | 1161898 |                                              |
 4 | DERIVED     | NULL       | NULL   | NULL             | NULL        | NULL    | NULL                  |    NULL | No tables used                               |
 3 | DERIVED     | a          | index  | PRIMARY,ndx_user | ndx_product | 24      | NULL                  | 2603025 | Using index; Using temporary; Using filesort | 
 3 | DERIVED     | b          | ref    | PRIMARY,ndx_user | PRIMARY     | 116     | product_visits.a.user |       1 | Using where; Using index                     |

虽然SQL确实完全符合我的要求,但性能却让我感到害怕。关于加快这一点的任何想法?

1 个答案:

答案 0 :(得分:0)

我认为MySQL中最简单的方法是使用变量:

SELECT tr.*
FROM (SELECT tr.*,
             (@rn := if(@p = target_product_id, @rn + 1,
                        if(@p := target_product_id, 1, 1)
                       )
             ) as rn
      FROM (SELECT a.`product_id` as target_product_id, b.`product_id` as related_product_id, 
                   count(*) visits
            FROM `product_visits` a INNER JOIN
                 `product_visits` b
                 ON a.`visitor_id` = b.`visitor_id` AND
                    b.`product_id` != a.`product_id`
            GROUP BY a.`product_id`, b.`product_id`
            ORDER BY a.`product_id`, COUNT(*) desc
           ) tr CROSS JOIN
           (SELECT @p := -1, @rn := 0) params
      ) tr
WHERE rn <= 2
ORDER BY target_product_id ASC, visits DESC;