Microsoft SQL Server中的自然(人类字母数字)排序

时间:2018-03-20 00:33:51

标签: sql sql-server sorting natural-sort

感谢您花时间阅读所有这些,很多!感谢所有爱好者!

如何自然排序?

即。命令一组字母数字数据显示为:

Season 1, Season 2, Season 10, Season 20

而不是

Season 1, Season 10, Season 2, Season 20

我以非常实用的格式使用一个非常实用的电视季节示例作为案例。

我希望完成以下任务:

  1. 分享我的工作解决方案
  2. 请求您帮助确定如何缩短(或找到更好的解决方案)到我的解决方案
  3. 你能解决下面的标准7吗?
  4. 我花了大约2个小时在线研究,另外3个小时建立了这个解决方案。一些参考资料来自:

    在SO和其他网站上找到的一些解决方案仅适用于90%的案例。但是,如果文本中有多个数值,则大多数/全部都不起作用,或者如果文本中根本没有找到数字,则会导致SQL错误。

    我已经创建了这个SQLFiddle链接来玩(包括以下所有代码)。

    这是create语句:

    create table tvseason
    (
        title varchar(100)
    );
    
    insert into tvseason (title)
    values ('100 Season 03'), ('100 Season 1'),
           ('100 Season 10'), ('100 Season 2'),
           ('100 Season 4'), ('Show Season 1 (2008)'),
           ('Show Season 2 (2008)'), ('Show Season 10 (2008)'),
           ('Another Season 01'), ('Another Season 02'),
           ('Another 1st Anniversary Season 01'),
           ('Another 2nd Anniversary Season 01'),
           ('Another 10th Anniversary Season 01'),
           ('Some Show Another No Season Number'),
           ('Some Show No Season Number'),
           ('Show 2 Season 1'),
           ('Some Show With Season Number 1'),
           ('Some Show With Season Number 2'),
           ('Some Show With Season Number 10');
    

    这是我的工作解决方案(只能解决以下标准#7):

    select 
        title, "index", titleLeft,
        convert(int, coalesce(nullif(titleRightTrim2, ''), titleRight)) titleRight
    from
        (select 
             title, "index", titleLeft, titleRight, titleRightTrim1,
             case 
                when PATINDEX('%[^0-9]%', titleRightTrim2) = 0 
                   then titleRightTrim2
                   else left(titleRightTrim2, PATINDEX('%[^0-9]%', titleRightTrim2) - 1)
             end as titleRightTrim2
         from
             (select
                  title, 
                  len(title) - PATINDEX('%[0-9] %', reverse(title)) 'index',
                  left(title, len(title) - PATINDEX('%[0-9] %', reverse(title))) titleLeft,
                  ltrim(right(title, PATINDEX('%[0-9] %', reverse(title)))) titleRight,
                  ltrim(right(title, PATINDEX('%[0-9] %', reverse(title)))) titleRightTrim1,
                  left(ltrim(right(title, PATINDEX('%[0-9] %', reverse(title)))), PATINDEX('% %', ltrim(right(title, PATINDEX('%[0-9] %', reverse(title)))))) titleRightTrim2
              from
                  tvseason) x) y
    order by 
        titleLeft, titleRight
    

    要考虑的标准:

    1. 文字不包含数字
    2. 文字包含开头和结尾的数字
    3. 文字仅包含数字
    4. 文字仅包含最后的数字
    5. 文字最后可能包含(YYYY)
    6. 文字可以以单个数字或双位数字(例如1或01)结束
    7. 可选:上述任意组合,以及文字中间的数字
    8. 这是输出:

      title
      100 Season 1
      100 Season 2
      100 Season 03
      100 Season 4
      100 Season 10
      **Case 7 here**
      Another 10th Anniversary Season 01
      Another 1st Anniversary Season 01
      Another 2nd Anniversary Season 01
      Another Season 01
      Another Season 02
      Show (2008) Season 1
      Show (2008) Season 2
      Show 2 The 75th Anniversary Season 1
      Show Season 1 (2008)
      Show Season 2 (2008)
      Show Season 10 (2008)
      Some Show Another No Season Number
      Some Show No Season Number
      Some Show With Season Number 1
      Some Show With Season Number 2
      Some Show With Season Number 10
      

4 个答案:

答案 0 :(得分:3)

我认为这可以解决问题……我只是认识到从非数字到数字的变化。 我尚未进行任何大规模测试,但是应该相当快。

SET QUOTED_IDENTIFIER ON;
GO
SET ANSI_NULLS ON;
GO

ALTER FUNCTION dbo.tfn_SplitForSort
/* ===================================================================
11/11/2018 JL, Created: Comments    
=================================================================== */
--===== Define I/O parameters
(
    @string VARCHAR(8000)
)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN 
    WITH 
        cte_n1 (n) AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n (n)), 
        cte_n2 (n) AS (SELECT 1 FROM cte_n1 a CROSS JOIN cte_n1 b),
        cte_Tally (n) AS (
            SELECT TOP (LEN(@string))
                ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
            FROM
                cte_n2 a CROSS JOIN cte_n2 b
            ),
        cte_split_string AS (
            SELECT 
                col_num = ROW_NUMBER() OVER (ORDER BY t.n) + CASE WHEN LEFT(@string, 1) LIKE '[0-9]' THEN 0 ELSE 1 END,
                string_part = SUBSTRING(@string, t.n, LEAD(t.n, 1, 8000) OVER (ORDER BY t.n) - t.n)
            FROM
                cte_Tally t
                CROSS APPLY ( VALUES (SUBSTRING(@string, t.n, 2)) ) s (str2)
            WHERE 
                t.n = 1
                OR SUBSTRING(@string, t.n - 1, 2) LIKE '[0-9][^0-9]'
                OR SUBSTRING(@string, t.n - 1, 2) LIKE '[^0-9][0-9]'
            )

    SELECT 
        so_01 = ISNULL(MAX(CASE WHEN ss.col_num = 1 THEN CONVERT(FLOAT, ss.string_part) END), 99999999),
        so_02 = MAX(CASE WHEN ss.col_num = 2 THEN ss.string_part END),
        so_03 = MAX(CASE WHEN ss.col_num = 3 THEN CONVERT(FLOAT, ss.string_part) END),
        so_04 = MAX(CASE WHEN ss.col_num = 4 THEN ss.string_part END),
        so_05 = MAX(CASE WHEN ss.col_num = 5 THEN CONVERT(FLOAT, ss.string_part) END),
        so_06 = MAX(CASE WHEN ss.col_num = 6 THEN ss.string_part END),
        so_07 = MAX(CASE WHEN ss.col_num = 7 THEN CONVERT(FLOAT, ss.string_part) END),
        so_08 = MAX(CASE WHEN ss.col_num = 8 THEN ss.string_part END),
        so_09 = MAX(CASE WHEN ss.col_num = 9 THEN CONVERT(FLOAT, ss.string_part) END),
        so_10 = MAX(CASE WHEN ss.col_num = 10 THEN ss.string_part END)
    FROM
        cte_split_string ss;
GO

正在使用的功能...

SELECT 
    ts.*
FROM
    #tvseason ts
    CROSS APPLY dbo.tfn_SplitForSort (ts.title) sfs
ORDER BY
    sfs.so_01,
    sfs.so_02,
    sfs.so_03,
    sfs.so_04,
    sfs.so_05,
    sfs.so_06,
    sfs.so_07,
    sfs.so_08,
    sfs.so_09,
    sfs.so_10;

结果:

id          title
----------- ------------------------------------------
2           100 Season 1
4           100 Season 2
1           100 Season 03
5           100 Season 4
3           100 Season 10
11          Another 1st Anniversary Season 01
12          Another 2nd Anniversary Season 01
13          Another 10th Anniversary Season 01
9           Another Season 01
10          Another Season 02
16          Show 2 Season 1
6           Show Season 1 (2008)
7           Show Season 2 (2008)
8           Show Season 10 (2008)
14          Some Show Another No Season Number
15          Some Show No Season Number
17          Some Show With Season Number 1
18          Some Show With Season Number 2
19          Some Show With Season Number 10

答案 1 :(得分:2)

就个人而言,我会尽量避免在SQL中进行复杂的字符串操作。我可能会将其转储到文本文件中,并使用C#或Python之类的正则表达式处理它。然后在单独的列中将其写回DB。 SQL在字符串操作方面非常糟糕。

然而,这是我对SQL方法的尝试。这个想法基本上是首先消除其中没有字符串Season [number]的任何行。处理没有季节要解析的情况。我选择将它们包含在null中,但您可以在where子句中轻松省略它们,或者给它们一些默认值。我使用stuff()函数来截断字符串Season [number]以外的所有内容,因此更容易使用。

现在我们有一个以季节编号开头的字符串,可能以垃圾结尾。我使用case语句来查看是否有垃圾(任何非数字),如果有,我拿最左边的数字字符然后扔掉其余的。如果只有数字开头,我就保持原样。

最后,将其转换为int,然后按它排序。

if object_id('tempdb.dbo.#titles') is not null drop table #titles
create table #titles (Title varchar(100))
insert into #titles (TItle)
select title = '100 Season 1'
union all select '100 Season 2'
union all select '100 Season 03'
union all select '100 Season 4'
union all select '100 Season 10'
union all select 'Another 10th Anniversary Season 01'
union all select 'Another 1st Anniversary Season 01'
union all select 'Another 2nd Anniversary Season 01'
union all select 'Another Season 01'
union all select 'Another Season 02'
union all select 'Show (2008) Season 1'
union all select 'Show (2008) Season 2'
union all select 'Show 2 The 75th Anniversary Season 1'
union all select 'Show Season 1 (2008)'
union all select 'Show Season 2 (2008)'
union all select 'Show Season 10 (2008)'
union all select 'Some Show Another No Season Number'
union all select 'Some Show No Season Number'
union all select 'Some Show With Season Number 1'
union all select 'Some Show With Season Number 2'
union all select 'Some Show With Season Number 10'

;with src as
(
    select 
        Title, 
        Trimmed = case when Title like '%Season [0-9]%' 
                       then stuff(title, 1, patindex('%season [0-9]%', title) + 6, '')
                       else null
                  end
    from #titles
)
select 
    Season = cast(case when Trimmed like '%[^0-9]%' then left(Trimmed, patindex('%[^0-9]%', Trimmed))
         else Trimmed
    end as int),
    Title
from src
order by Season 

答案 2 :(得分:2)

此问题要求很复杂。因此,无法通过简单的查询来实现。所以我的解决方案如下: 首先,我创建一个示例数据,该数据将在此查询中使用。

CREATE TABLE #TVSEASON (TITLE VARCHAR(100));
INSERT INTO #TVSEASON (TITLE) VALUES 
('100'),
('100 SEASON 03'),
('100 SEASON 1'),
('100 SEASON 10'),
('100 SEASON 2'),
('100 SEASON 4'),
('SHOW (2008) SEASON 1'),
('SHOW (2008) SEASON 2'),
('SHOW SEASON 1 (2008)'),
('SHOW SEASON 2 (2008)'),
('SHOW SEASON 10 (2008)'),
('ANOTHER 1ST ANNIVERSARY SEASON 01'),
('ANOTHER 2ND ANNIVERSARY SEASON 01'),
('ANOTHER 10TH ANNIVERSARY SEASON 01'),
('ANOTHER SEASON 01'),
('ANOTHER SEASON 02'),
('SOME SHOW ANOTHER NO SEASON NUMBER'),
('SOME SHOW NO SEASON NUMBER'),
('SHOW 2 THE 75TH ANNIVERSARY SEASON 1'),
('SOME SHOW WITH SEASON NUMBER 1'),
('SOME SHOW WITH SEASON NUMBER 2'),
('SOME SHOW WITH SEASON NUMBER 10')

为了获得所需的结果,我创建了一个函数,用于拆分文本中的所有单词和数字。 (注意:如果有任何用户错误地键入1st之间的空格,则为了安全起见,在修剪1st之间的空格后,我还会通过函数从1st,nd从2nd等中删除st,因此,如果您认为没有错误的机会,请从该功能,因为要删除该值,如果文本的值如“ 1 the title”(将转换为1 e title),也将删除此值。

--CREATE SPLIT FUNCTION
CREATE FUNCTION [dbo].[SplitAlphaNumeric]
(
    @LIST NVARCHAR(2000)
) 
RETURNS @RTNVALUE TABLE
(

    ID INT IDENTITY(1,1),
    WORDS NVARCHAR(100),
    NUMBERS INT
)
AS 
BEGIN
    WHILE (PATINDEX('%[0-9]%',@LIST) > 0)
    BEGIN
        INSERT INTO @RTNVALUE (WORDS, NUMBERS)
        SELECT  CASE    WHEN PATINDEX('%[0-9]%',@LIST) = 0 THEN @LIST
                        WHEN (PATINDEX('%[0-9]%',@LIST) = 1 AND PATINDEX('%[^0-9]%',@LIST) = 0) THEN ''
                        WHEN PATINDEX('%[0-9]%',@LIST) = 1 THEN ''
                        ELSE SUBSTRING(@LIST, 1, PATINDEX('%[0-9]%',@LIST) - 1) 
                END,
                CASE    WHEN PATINDEX('%[0-9]%',@LIST) = 0 THEN NULL
                        WHEN (PATINDEX('%[0-9]%',@LIST) = 1 AND PATINDEX('%[^0-9]%',@LIST) = 0) THEN CAST(LTRIM(RTRIM(@LIST)) AS INT)
                        WHEN PATINDEX('%[0-9]%',@LIST) = 1 THEN SUBSTRING(@LIST, 1, PATINDEX('%[^0-9]%',@LIST) - 1) 
                        ELSE NULL
                END

            SET @LIST = LTRIM(RTRIM(CASE    WHEN PATINDEX('%[0-9]%',@LIST) = 0 OR (PATINDEX('%[0-9]%',@LIST) = 1 AND PATINDEX('%[^0-9]%',@LIST) = 0) THEN ''
                                            WHEN PATINDEX('%[0-9]%',@LIST) = 1 THEN 
                                                    CASE    WHEN LTRIM(SUBSTRING(@LIST, PATINDEX('%[^0-9]%',@LIST), LEN(@LIST)-PATINDEX('%[^0-9]%',REVERSE(@LIST)))) LIKE 'ST%' THEN SUBSTRING(LTRIM(SUBSTRING(@LIST, PATINDEX('%[^0-9]%',@LIST), LEN(@LIST)-PATINDEX('%[^0-9]%',REVERSE(@LIST)))),3, LEN(LTRIM(SUBSTRING(@LIST, PATINDEX('%[^0-9]%',@LIST), LEN(@LIST)-PATINDEX('%[^0-9]%',REVERSE(@LIST))))))
                                                            WHEN LTRIM(SUBSTRING(@LIST, PATINDEX('%[^0-9]%',@LIST), LEN(@LIST)-PATINDEX('%[^0-9]%',REVERSE(@LIST)))) LIKE 'ND%' THEN SUBSTRING(LTRIM(SUBSTRING(@LIST, PATINDEX('%[^0-9]%',@LIST), LEN(@LIST)-PATINDEX('%[^0-9]%',REVERSE(@LIST)))),3, LEN(LTRIM(SUBSTRING(@LIST, PATINDEX('%[^0-9]%',@LIST), LEN(@LIST)-PATINDEX('%[^0-9]%',REVERSE(@LIST))))))
                                                            WHEN LTRIM(SUBSTRING(@LIST, PATINDEX('%[^0-9]%',@LIST), LEN(@LIST)-PATINDEX('%[^0-9]%',REVERSE(@LIST)))) LIKE 'RD%' THEN SUBSTRING(LTRIM(SUBSTRING(@LIST, PATINDEX('%[^0-9]%',@LIST), LEN(@LIST)-PATINDEX('%[^0-9]%',REVERSE(@LIST)))),3, LEN(LTRIM(SUBSTRING(@LIST, PATINDEX('%[^0-9]%',@LIST), LEN(@LIST)-PATINDEX('%[^0-9]%',REVERSE(@LIST))))))
                                                            WHEN LTRIM(SUBSTRING(@LIST, PATINDEX('%[^0-9]%',@LIST), LEN(@LIST)-PATINDEX('%[^0-9]%',REVERSE(@LIST)))) LIKE 'TH%' THEN SUBSTRING(LTRIM(SUBSTRING(@LIST, PATINDEX('%[^0-9]%',@LIST), LEN(@LIST)-PATINDEX('%[^0-9]%',REVERSE(@LIST)))),3, LEN(LTRIM(SUBSTRING(@LIST, PATINDEX('%[^0-9]%',@LIST), LEN(@LIST)-PATINDEX('%[^0-9]%',REVERSE(@LIST))))))
                                                            ELSE LTRIM(SUBSTRING(@LIST, PATINDEX('%[^0-9]%',@LIST), LEN(@LIST)-PATINDEX('%[^0-9]%',REVERSE(@LIST))))
                                                    END
                                            ELSE SUBSTRING(@LIST, PATINDEX('%[0-9]%',@LIST), LEN(@LIST)-PATINDEX('%[0-9]%',REVERSE(@LIST))) 
                                    END))
    END
    INSERT INTO @RTNVALUE (WORDS)
    SELECT VALUE = LTRIM(RTRIM(@LIST))
    RETURN
END

在第三步中,我对调用函数使用交叉应用,因为函数针对给定的字符串值返回表。在选择查询时,我将所有列插入到临时表中,以按照下一步要求进行排序。

SELECT  T.TITLE, A.ID, A.NUMBERS, A.WORDS INTO #FINAL
FROM    #TVSEASON T
        CROSS APPLY dbo.SplitAlphaNumeric(TITLE) A

在临时表#Final中,我使用东西连接所有单词以再次使标题成为标题,而文本中没有出现任何数字,然后使用该值对标题进行排序。

  

您可以按任意顺序更改查询的顺序,例如   按文本排序,然后先排序textval列,然后   数字,但如果要对所有数字求和   在标题中使用,然后像我一样在总和之后先订购编号   否则,如果您要订购不加和的简单数字,那就不要   使用group by子句和子查询并直接按数字排序。   简而言之,您可以实现所有与字母数字有关的序列   修改下面的查询和上面的查询后的值是基础   查询所有目标。

SELECT  A.TITLE--, A.NUMBERS, A.TEXTVAL
FROM    (
            SELECT  A.TITLE, 
                    STUFF((
                        SELECT  ' ' + B.WORDS 
                        FROM    #FINAL B
                        WHERE   B.TITLE = A.TITLE
                        FOR XML PATH(''),TYPE).VALUE('(./TEXT())[1]','VARCHAR(MAX)')
                    ,1,1,'') TEXTVAL,
                    SUM(ISNULL(A.NUMBERS,0)) NUMBERS
            FROM    #FINAL A
            GROUP BY A.TITLE
        ) A 
ORDER BY A.TEXTVAL, A.NUMBERS

DROP TABLE #FINAL
DROP TABLE #TVSEASON

最后,我从内存中删除了两个临时表。我认为这是您想要对值进行排序的查询,因为如果有人对字母数字值有不同的顺序要求,他们可以在修改完该查询后实现其要求。

答案 3 :(得分:0)

我的答案利用OPEN_JSON将每个标题拆分为单词,然后用相同数量的'a'代替数字。例如2变为aa,10变为aaaaaaaa。这给我们留下了一组行,每个单词1行。然后,我在每个标题中使用STRING_AGG将它们重新结合在一起,以创建一个新标题,其中包含用a代替的数字。然后,我以此排序并报告原始标题:

with Words1 as 
(
    select title, REPLACE(REPLACE(value, '(', ''), ')', '') word, [key] as RowN
    from tvseason
   CROSS APPLY OPENJSON('["' +  
      REPLACE(REPLACE(REPLACE(title,' ','","'),'\','\\"'),'"','\"') + 
      '"]')
),
Words2
AS
(
    SELECT title,
           CASE 
                WHEN ISNUMERIC(word) = 1 THEN Replicate('a', CAST(Word as INT))
                WHEN word like '%st' AND ISNUMERIC(LEFT(word, LEN(Word)-2)) = 1
                   THEN Replicate('a', CAST(LEFT(Word, LEN(Word)-2) as INT))
                WHEN word like '%nd' AND ISNUMERIC(LEFT(word, LEN(Word)-2)) = 1
                   THEN Replicate('a', CAST(LEFT(Word, LEN(Word)-2) as INT))
                WHEN word like '%rd' AND ISNUMERIC(LEFT(word, LEN(Word)-2)) = 1
                   THEN Replicate('a', CAST(LEFT(Word, LEN(Word)-2) as INT))
                WHEN word like '%th' AND ISNUMERIC(LEFT(word, LEN(Word)-2)) = 1
                   THEN Replicate('a', CAST(LEFT(Word, LEN(Word)-2) as INT))
                else Word 
                END As Word,
                rowN
    from words1
),
Words3
AS
(
    SELECT title, STRING_AGG(Word, ' ') WITHIN GROUP (Order By rowN ASC) AS TitleLong
    FROM Words2
    GROUP BY Title
)
SELECT title
FROM Words3
ORDER BY TitleLong

这将产生以下结果:

**title**
100 Season 1
100 Season 2
100 Season 03
100 Season 4
100 Season 10
Another 1st Anniversary Season 01
Another 2nd Anniversary Season 01
Another 10th Anniversary Season 01
Another Season 01
Another Season 02
Show 2 Season 1
Show Season 1 (2008)
Show Season 2 (2008)
Show Season 10 (2008)
Some Show Another No Season Number
Some Show No Season Number
Some Show With Season Number 1
Some Show With Season Number 2
Some Show With Season Number 10