删除每个类别中除最新N个表行之外的所有行(多个类别组合在varchar中)

时间:2014-01-22 10:54:55

标签: sql sql-server sql-server-2008

我有一个新闻文章的SQL表,每篇文章都可以出现在几个类别中。遗憾的是,这些类别已存储为每行的单个varchar中连接的文本值。

我想在每个类别中保留前5个新闻文章,并删除较旧的文章。我不认为没有程序代码(SQL循环/游标,或者知道所有可能的类别名称的外部程序重复调用SQL)。

这是我的测试数据,没有新闻文章标题/内容。我相信代码首先需要删除不需要的类别字符串,然后删除已删除所有类别的所有行。

declare @News table(ArticleId INTEGER NOT NULL, DateAdded SMALLDATETIME NOT NULL, Categories VARCHAR(250) NOT NULL)  
insert into @News values (11, '2014-01-11', 'SPORT~CELEBS~')  
insert into @News values (10, '2014-01-10', 'SPORT~CELEBS~POLITICS~')  
insert into @News values (9, '2014-01-09', 'SPORT~CELEBS~')  
insert into @News values (8, '2014-01-08', 'SPORT~NATURE~')  
insert into @News values (7, '2014-01-07', 'SPORT~CELEBS~')  
insert into @News values (6, '2014-01-06', 'SPORT~CELEBS~POLITICS~') --ought to have SPORT label removed  
insert into @News values (5, '2014-01-05', 'POLITICS~')  
insert into @News values (4, '2014-01-04', 'POLITICS~')  
insert into @News values (3, '2014-01-03', 'POLITICS~')  
insert into @News values (2, '2014-01-02', 'POLITICS~') --ought to get deleted  
insert into @News values (1, '2014-01-01', 'CELEBS~') --ought to get deleted

--magic happens  

delete from @News where Categories = ''  
select * from @News order by DateAdded desc

如果唯一的解决方案是使用WHILECURSOR,那么我将选择将SQL包装在存储过程中,并使用值'SPORT~'重复调用它,然后'CELEBS~ '然后'政治〜'等。

1 个答案:

答案 0 :(得分:0)

我已经找到了一个部分(并且非常不优雅)的解决方案。方法是重新创建可能存在的“粘合”表,如果这是一个理智的数据库(尽管正确的表将有两个FK)。

--create list of all possible category values (get first category from every row, then second, then third, etc)
declare @Category table (SingleCategory VARCHAR(50))
insert into @Category
select distinct LEFT(SingleCategory, charindex('~', SingleCategory))
from (
    select categories as SingleCategory from @News
    union
    select SUBSTRING(categories, charindex('~', categories)+1, 100) from @News where Categories like '%~%~'
    union
    select SUBSTRING(categories, charindex('~', categories, charindex('~', categories)+1)+1, 100) from @News where Categories like '%~%~%~'
    --repeat if 4 and 5 occurances possible, etc
) sq

--create a 'glue' table
declare @Glue table(ArticleId INT NOT NULL, DateAdded SMALLDATETIME NOT NULL, Category VARCHAR(50) NOT NULL)
insert into @Glue 
select articleid, dateadded, SingleCategory
from @News n
inner join @Category c on n.categories LIKE '%' + c.SingleCategory + '%'

--use the glue table to identify the articles we do want, and delete all the others
delete from @News where ArticleId not in (
    SELECT articleid
    FROM (
        SELECT articleid, Category, 
        RANK() OVER(PARTITION BY Category ORDER BY dateadded DESC) AS RankThem
        FROM @Glue
    ) sq
    WHERE RankThem <= 5
)

这摆脱了我们不想要的两行,但我们最终在SPORT类别中有6篇文章,所以它不是一个完美的解决方案。还有更好的方法吗?