从连接表中获取每个外键的第一条记录,而不需要重复的主键

时间:2013-01-26 18:04:15

标签: mysql sql sql-server tsql

我有以下表结构:

Tags:
Tag_ID | Name
1      | Tag1
2      | Tag2
3      | Tag3
4      | Tag4
5      | Tag5
6      | Tag6

Posts:
Post_ID | Title | Body
1       | Post1 | Post1
2       | Post2 | Post2
3       | Post3 | Post3
4       | Post4 | Post4
5       | Post5 | Post5
6       | Post6 | Post6
7       | Post7 | Post7
8       | Post8 | Post8
9       | Post9 | Post9
10      | Post10| Post10

TagsPosts:
Tag_ID | Post_ID
1      | 1
1      | 2
1      | 3
1      | 4
1      | 5
1      | 10
1      | 1
2      | 1
2      | 2
2      | 6
2      | 7
3      | 4
3      | 8
3      | 9
4      | 7
5      | 1
5      | 2
5      | 3
5      | 4
5      | 5
5      | 6
5      | 7
6      | 2

我需要从查询中返回的是最常见的Posts的前3 TagPost的其余部分的前1 Tags而未提供任何重复Posts

Desired Output:
Tag_ID | Post_ID
5      | 1
5      | 2
5      | 3
1      | 10
2      | 6
3      | 9
4      | 7

到目前为止,我能够使用以下内容确定最常见Posts的前3 Tag

SELECT Top(3) t.Tag_ID, p.Post_ID FROM Tags as t
INNER JOIN TagsPosts as tp ON t.Tag_ID = tp.Tag_ID
INNER JOIN Posts as p ON tp.Post_ID = p.Post_ID
WHERE t.Tag_ID IN (
    SELECT TOP(1) Tag_ID FROM TagsPosts GROUP BY Tag_ID ORDER BY COUNT(Tag_ID) DESC)

Result:
Tag_ID | Post_ID
5      | 1
5      | 2
5      | 3

我还使用以下内容确定了Post其余部分的前1名Tags

SELECT t.Tag_ID, p.Post_ID FROM Tags as t
INNER JOIN (
    SELECT t.Tag_ID, Max(p.Post_ID) as Post_ID FROM Tags as t
INNER JOIN TagsPosts as tp ON t.Tag_ID = tp.Tag_ID
INNER JOIN Posts as p ON tp.Post_ID = p.Post_ID
WHERE t.Tag_ID NOT IN (
        SELECT TOP(1) Tag_ID FROM TagsPosts GROUP BY Tag_ID ORDER BY COUNT(Tag_ID) DESC)
    AND
p.Post_ID NOT IN (
        SELECT Top(3) p.Post_ID FROM Tags as t
    INNER JOIN TagsPosts as tp ON t.Tag_ID = tp.Tag_ID
    INNER JOIN Posts as p ON tp.Post_ID = p.Post_ID
    WHERE t.Tag_ID IN (
        SELECT TOP(1) Tag_ID FROM TagsPosts GROUP BY Tag_ID ORDER BY COUNT(Tag_ID) DESC))
    GROUP BY t.Tag_ID) as s ON t.Tag_ID = s.Tag_ID
INNER JOIN Posts as p ON s.Post_ID = p.Post_ID

Result:
Tag_ID | Post_ID
1      | 10
2      | 7
3      | 9
4      | 7

这几乎就在那里,但正如您所看到的,它会返回重复的Posts

顺便说一下,我使用SQL Server 2008 Express进行测试,因为我不熟悉MySQL,但我被要求确定可以应用于MySQL数据库的SQL查询。我想如果我在T-SQL中得到基本查询,那么转换成MySQL使用的任何SQL都会相当简单。

1 个答案:

答案 0 :(得分:0)

我会使用窗口函数,将其存储在CTE中,然后在谓词中引用它。像这样(使用可以从SSMS运行的数据的简化版本)。您列出了SQL-Server但未列出版本。我相信表函数可以在2005版及更高版本的SQL Server上运行,但我不确定。

declare @Tag table ( tagid int identity, name varchar(8));

insert into @Tag values ('Tag1'),('Tag2'),('Tag3'),('Tag4'),('Tag5'),('Tag6');

declare @Posts table (postid int identity, tagid int, postbody varchar(32));

insert into @Posts values (1,'Blah'),(1, 'Blahblah'),(2, 'Blahblah'),(3, 'Blahbodyblah'),(4, 'Blahblahblah'),(4, 'Blahbodyblah'),(4, 'Blah'),(5, 'Blah'),(5, 'Blahblah'),(6, 'Blahblah');

-- use a CTE
with a as 
    (
    select 
        p.postbody
    ,   count(t.tagid) as TimesTagged
        /* You stated you wanted a return of posts based on their occurrence.  I am counting a position 
        of the COUNTS OF TAGID's descending (greatest first) starting from one.  If you have a tie and want to 
        do those I would consider using DENSE_RANK.  You would have to insert more values where you get a third 
        occurence to become a TIE to see how Rank, Dense_Rank, and Row_number differ.  They all have their 
        purposes but the user should know what they want before determining which they use.
        */
    ,   row_number() over(order by count(t.tagid) desc) as PositionOfCountsTaggedByGreatestOrderFirst
    ,   Rank() over(order by count(t.tagid) desc) as PositionOfCountsTaggedByGreatestOrderFirst_Ranking
    ,   Dense_Rank() over(order by count(t.tagid) desc) as PositionOfCountsTaggedByGreatestOrderFirst_DenseRanking
    from @Tag t 
        join @Posts p on t.tagid = p.tagid
    group by p.postbody
    )
select *
from a
-- I only use Row_Number, you can change to use one of the other predicates above if you wish.
where PositionOfCountsTaggedByGreatestOrderFirst <= 3


/*
You are stating you only want the top three counts
windowed functions are better than using top IMHO as you can specify lists 'in', medians, and all other types
explicitly defined rather than having to repeating nested selects.  The only downer is you can not use 
a predicate on a windowed function directly.  Yout must create it and then in a nested select, CTE (as shown)
, a table variable, temp table, etc...  define a predicate on it.
*/