Question

我试图改善查询的性能。从EXPLAIN ANALYZE我明白，当我认为没有必要时，我的查询会考虑太多songs条记录。

有三个表artists(artist_id, score)，songs(song_id, artist_id)和listened(song_id)。

我当前的查询如下所示：

WITH artists_ranked AS (
    SELECT
      artist_id
      , rank() OVER (ORDER BY score ) rnk
    ORDER BY rnk ASC
),
    not_listened_songs AS (
      SELECT *
      FROM songs
      WHERE NOT EXISTS(
          SELECT 1
          FROM listened
          WHERE listened.song_id = songs.song_id) -- bad: I go through all songs
  ),
    shuffled_songs AS (
      SELECT *
      FROM artists_ranked
        JOIN not_listened_songs ON not_listened_songs.artist_id = artists_ranked.artist_id
      ORDER BY random() --bad: I shuffle all songs
  )
SELECT DISTINCT ON (artist_id) *
FROM shuffled_songs
LIMIT 1;

理想情况下（至少在我看来），查询应遵循以下步骤：

按评分对artists表进行排名。
选出一批评分最高的艺术家。可以是一个或多个艺术家。
加入表songs，但已排除listened首歌曲。
现在我们想通过给每位艺术家提供相同的机会来挑选一首随机歌曲。 ORDER BY random()，DISTINCT BY (artist_id)，LIMIT 1
如果有这样的歌曲，我们会停下来并将其归还。否则，请选择下一批艺术家（排名最低的级别）并重复这些步骤。
- 要停止，要么返回一首歌（很可能在几次迭代之后）或者所有艺术家都被考虑过。

谢谢。

Answer 1

从关系代数的角度考虑问题，而不是循环。

要获取尚未播放的歌曲，请将artists加入songs，song_id不存在listened。按分数降序排列，先得到评分最高的艺术家的歌曲，然后在每个分数中随机随机播放。限制为1条记录。

SELECT song_id
FROM artists a
JOIN songs s ON s.artist_id = a.artist_id
WHERE NOT EXISTS (SELECT TRUE FROM listened l WHERE l.song_id = s.song_id)
ORDER BY score DESC, RANDOM()
LIMIT 1

我们可以通过考虑等量的歌曲给予每个最高分艺术家平等的机会。艺术家可以拥有不同数量的歌曲。如果有2位艺术家获得最高分，1位拥有100首歌曲，另外1首歌曲，则从第二位艺术家中选择一首歌曲的概率为0.01，但应为0.5

对每位艺术家尚未随机收听的歌曲进行排名，然后根据得分降序排列最终结果。然后是歌曲等级，它实际上交错了来自同一等级的所有艺术家的随机歌曲：

SELECT song_id
FROM artists a
NATURAL JOIN songs s 
WHERE NOT EXISTS (
    SELECT TRUE 
    FROM listened l 
    WHERE l.song_id = s.song_id
)
ORDER BY score DESC
       , ROW_NUMBER() OVER (PARTITION BY artist_id ORDER BY RANDOM())
       , FIRST_VALUE(RANDOM()) OVER (PARTITION BY artist_id)

Answer 2

我尝试使用LATERAL JOIN以score顺序逐个引导艺术家。

将artist_id添加到listened表格，以避免额外加入，并且一次只能搜索到一位艺术家。

向表中添加索引。拥有这些索引很重要。

artists (score, artist_id)
songs (artist_id, song_id)
listened (artist_id, song_id)

<强>查询

SELECT
    artists.artist_id
    ,s.song_id
FROM
    artists
    INNER JOIN LATERAL
    (
        SELECT songs.song_id
        FROM songs
        WHERE
            songs.artist_id = artists.artist_id
            AND NOT EXISTS
            (
                SELECT 1
                FROM listened
                WHERE
                    listened.artist_id = songs.artist_id
                    -- limit listened songs to one artist
                    AND listened.song_id = songs.song_id
            )
        ORDER BY random()
        -- shuffle only songs of one artist
        LIMIT 1
    ) AS s ON true
ORDER BY artists.score ASC, random()
-- if there are several artists with the same score
-- pick one random artist among them
LIMIT 1;

查询将选择顶级艺术家，随机播放歌曲，选择下一位顶级艺术家，随机播放他的歌曲等等。

当艺术家有要播放的歌曲时，此查询应该可以快速运行，并且会变得越来越慢，并且它会通过排名较低的行的顶级艺术家列表。

如果score不是唯一的，那么ORDER BY score LIMIT 1会返回一个＆＃34;随机＆＃34;以最高分排。没有定义哪个艺术家会被选中。它不是严格随机的，只是没有定义。它可以在每次查询运行时保持不变或保持不变。要使其真正随机，只需明确添加random()。

通过此添加，查询将在具有相同概率的相同最高分的几位艺术家之间进行选择，无论他们拥有多少首歌曲。

您可以将查询扩展为＆＃34;批次＆＃34;其认为的顶级N艺术家，不仅仅是每一位顶级艺术家：

WITH
CTE
AS
(
    SELECT
        artists.artist_id
        ,s.song_id
    FROM
        artists
        INNER JOIN LATERAL
        (
            SELECT songs.song_id
            FROM songs
            WHERE
                songs.artist_id = artists.artist_id
                AND NOT EXISTS
                (
                    SELECT 1
                    FROM listened
                    WHERE
                        listened.artist_id = songs.artist_id
                        -- limit listened songs to one artist
                        AND listened.song_id = songs.song_id
                )
            ORDER BY random()
            -- shuffle only songs of one artist
            LIMIT 1
        ) AS s ON true
    ORDER BY artists.score ASC
    LIMIT 5 -- pick top N artists, N = 5
)
SELECT
    artist_id
    ,song_id
FROM CTE
ORDER BY random() -- shuffle top N artists
LIMIT 1 -- pick one random artist out of top N

获取每个等级的批次记录，然后加入，然后在postgres中使用LIMIT 1

2 个答案: