SQL - STRING_SPLIT字符串位置

时间:2018-02-12 21:44:45

标签: sql sql-server

我有一个包含两列逗号分隔字符串的表。格式化数据的方式,两列中逗号分隔项的数量相等,colA中的第一个值与colB中的第一个值相关,依此类推。 (它显然不是一个非常好的数据格式,但它是我正在使用的。)

如果我有以下行(PrimaryKeyID | column1 | column2):

1 | a,b,c | A,B,C

然后以这种数据格式,& 1与逻辑相关,b& 2,等等。

我想使用STRING_SPLIT来拆分这些列,但是使用它们两次显然会相互交叉,从而产生总共9行。

1 | a | A
1 | b | A    
1 | c | A    
1 | a | B    
1 | b | B    
1 | c | B    
1 | a | C
1 | b | C    
1 | c | C

我想要的只是3"逻辑相关"列

1 | a | A
1 | b | B
1 | c | C

但是,STRING_SPLIT(myCol,',')似乎无法将字符串位置保存在任何位置。

我做了以下事情:

SELECT tbl.ID,
      t1.Column1Value,
      t2.Column2Value
FROM myTable tbl
INNER JOIN (
   SELECT t.ID, 
       ss.value AS Column1Value, 
       ROW_NUMBER() OVER (PARTITION BY t.ID ORDER BY t.ID) as StringOrder
   FROM myTable t
   CROSS APPLY STRING_SPLIT(t.column1,',') ss
) t1 ON tbl.ID = t1.ID
INNER JOIN (
   SELECT t.ID, 
       ss.value AS Column2Value, 
       ROW_NUMBER() OVER (PARTITION BY t.ID ORDER BY t.ID) as StringOrder
   FROM myTable t
   CROSS APPLY STRING_SPLIT(t.column2,',') ss
) t1 ON tbl.ID = t2.ID AND t1.StringOrder = t2.StringOrder

这似乎适用于我的小型测试装置,但在我看来,没有理由期望它每次都能得到保证。 ROW_NUMBER() OVER (PARTITION BY ID ORDER BY ID)显然是无意义的排序,但似乎在没有任何实际排序的情况下,STRING_SPLIT返回"默认"中的值。命令他们已经在。这是"预期"行为?我可以指望这个吗?有没有其他方法可以完成我试图做的事情?

感谢。

======================

修改

我使用以下UDF获得了我想要的(我认为)。然而,它很慢。有什么建议吗?

CREATE FUNCTION fn.f_StringSplit(@string VARCHAR(MAX),@delimiter VARCHAR(1))
RETURNS @r TABLE
(
    Position INT,
    String VARCHAR(255)
)
AS
BEGIN

    DECLARE @current_position INT
    SET @current_position = 1

    WHILE CHARINDEX(@delimiter,@string) > 0 BEGIN

        INSERT INTO @r (Position,String) VALUES (@current_position, SUBSTRING(@string,1,CHARINDEX(@delimiter,@string) - 1))

        SET @current_position = @current_position + 1
        SET @string = SUBSTRING(@string,CHARINDEX(@delimiter,@string) + 1, LEN(@string) - CHARINDEX(@delimiter,@string))

    END

    --add the last one
    INSERT INTO @r (Position, String) VALUES(@current_position,@string)

    RETURN
END

5 个答案:

答案 0 :(得分:2)

您的想法很好,但您的order by没有使用稳定的排序。我认为这样做更安全:

SELECT tbl.ID, t1.Column1Value, t2.Column2Value
FROM myTable tbl INNER JOIN
     (SELECT t.ID, ss.value AS Column1Value, 
             ROW_NUMBER() OVER (PARTITION BY t.ID
                                ORDER BY CHARINDEX(',' + ss.value + ',', ',' + t.column1 + ',')
                               ) as StringOrder
      FROM myTable t CROSS APPLY
           STRING_SPLIT(t.column1,',') ss
     ) t1
     ON tbl.ID = t1.ID INNER JOIN
     (SELECT t.ID, ss.value AS Column2Value, 
             ROW_NUMBER() OVER (PARTITION BY t.ID
                                ORDER BY CHARINDEX(',' + ss.value + ',', ',' + t.column2 + ',')
                               ) as StringOrder
      FROM myTable t CROSS APPLY
           STRING_SPLIT(t.column2, ',') ss
     ) t2
     ON tbl.ID = t2.ID AND t1.StringOrder = t2.StringOrder;

注意:如果字符串具有不相邻的重复项,则可能无法正常工作。

答案 1 :(得分:1)

我对这个问题有点晚了,但我只是尝试使用string_split来做同样的事情,因为我最近遇到了性能问题。我在T-SQL中使用字符串拆分器的经验使我使用递归CTE来处理包含少于1,000个分隔值的大多数事物。理想情况下,如果在字符串拆分中需要序数,则将使用CLR过程。

那就是说,我从string_split获得序数时得出了与你类似的结论。你可以看到下面的查询和统计信息,它们依次是bare string_split函数,string_split的CTE RowNumber,然后是我从这个awesome write-up派生的我的个人字符串拆分CTE函数。基于CTE的功能和写作功能之间的主要区别在于我将其设为Inline-TVF,而不是实现MultiStatement-TVF,您可以阅读差异here

在我的实验中,我没有看到在一个常量返回分隔字符串的内部顺序时使用ROW_NUMBER的偏差,所以我将使用它直到我发现它有问题,但是如果订单是在商业环境中势在必行,我可能会推荐上面第一个链接中的Moden分割器,它链接到作者的文章here,因为它与安全性较低的性能一致。 string_split与RowNumber方法。

set nocount on;

declare
    @iter int = 0,
    @rowcount int,
    @val varchar(max) = '';

while len(@val) < 1e6
    select
        @val += replicate(concat(@iter, ','), 8e3),
        @iter += 1;

raiserror('Begin string_split Built-In', 0, 0) with nowait;

set statistics time, io on;

select
    *
from
    string_split(@val, ',')
where
    [value] > '';

select
    @rowcount = @@rowcount;

set statistics time, io off;

print '';
raiserror('End string_split Built-In | Return %d Rows', 0, 0, @rowcount) with nowait;
print '';
raiserror('Begin string_split Built-In with RowNumber', 0, 0) with nowait;

set statistics time, io on;

with cte
as  (
    select
        *,
        [group] = 1
    from
        string_split(@val, ',')
    where
        [value] > ''
    ),
    cteCount
as  (
    select
        *,
        [id] = row_number() over (order by [group])
    from
        cte
    )
select
    *
from
    cteCount;

select
    @rowcount = @@rowcount;

set statistics time, io off;

print '';
raiserror('End string_split Built-In with RowNumber | Return %d Rows', 0, 0, @rowcount) with nowait;
print '';
raiserror('Begin Moden String Splitter', 0, 0) with nowait;

set statistics time, io on;

select
    *
from
    dbo.SplitStrings_Moden(@val, ',')
where
    item > '';

select
    @rowcount = @@rowcount;

set statistics time, io off;

print '';
raiserror('End Moden String Splitter | Return %d Rows', 0, 0, @rowcount) with nowait;
print '';
raiserror('Begin Recursive CTE String Splitter', 0, 0) with nowait;

set statistics time, io on;

select
    *
from
    dbo.fn_splitByDelim(@val, ',')
where
    strValue > ''
option
    (maxrecursion 0);

select
    @rowcount = @@rowcount;

set statistics time, io off;

统计数据

Begin string_split Built-In

 SQL Server Execution Times:
   CPU time = 2000 ms,  elapsed time = 5325 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

End string_split Built-In | Return 331940 Rows

Begin string_split Built-In with RowNumber

 SQL Server Execution Times:
   CPU time = 2094 ms,  elapsed time = 8119 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

End string_split Built-In with RowNumber | Return 331940 Rows

Begin Moden String Splitter
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 6 ms.

 SQL Server Execution Times:
   CPU time = 8734 ms,  elapsed time = 9009 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

End Moden String Splitter | Return 331940 Rows

Begin Recursive CTE String Splitter
Table 'Worktable'. Scan count 2, logical reads 1991648, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 147188 ms,  elapsed time = 147480 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

End Recursive CTE String Splitter | Return 331940 Rows

答案 2 :(得分:1)

SELECT
PrimaryKeyID ,t2.items as column1, t1.items as column2 from [YourTableName]
cross Apply [dbo].[Split](column2) as t1
cross Apply [dbo].[Split](column1) as t2

答案 3 :(得分:0)

我发现表达性地维护String_Split()函数顺序的唯一方法是使用Row_Number()函数,其文字值在“ order by”中。

例如:

declare @Version nvarchar(128)
set @Version = '1.2.3';

with V as (select value v, Row_Number() over (order by (select 0)) n from String_Split(@Version, '.'))
    select
        (select v from V where n = 1) Major,
        (select v from V where n = 2) Minor,
        (select v from V where n = 3) Revision

返回:

Major Minor Revision
----- ----- ---------
1     2     3        

答案 4 :(得分:0)

马克,这是我要使用的解决方案。假设表中的[column 1]的“键”值不稳定,并且[column2]的对应“字段”值有时可以省略或为NULL:

  • 将有两种提取方式,一种是[column 1]-我认为是键,另一种是[column 2]-我认为是“键”的“值”类型,则会通过STRING_SPLIT函数对其进行自动解析。

  • 然后将根据操作时间(始终是连续的)对这两个INDEPENDENT结果集重新编号。请注意,我们不是通过字段内容或逗号等位置来重新编号,而通过时间戳来重新编号。

  • 然后他们将通过LEFT OUTER JOIN重新加入在一起; 请注意,并非INNER JOIN,因为我们的“字段值”可能会被忽略,而“键”将始终存在

下面是TSQL代码,因为这是我对此站点的第一篇文章,希望它看起来还可以:

SELECT T1.ID, T1.KeyValue, T2.FieldValue
from (select t1.ID, row_number() OVER (PARTITION BY t1.ID ORDER BY current_timestamp) AS KeyRow, t2.value AS KeyValue 
from myTable t1
CROSS APPLY STRING_SPLIT(t1.column1,',')  as t2) T1
LEFT OUTER JOIN
(select t1.ID, row_number() OVER (PARTITION BY t1.ID ORDER BY current_timestamp) AS FieldRow, t3.value AS FieldValue 
from myTable t1
CROSS APPLY STRING_SPLIT(t1.column2,',')  as t3) T2 ON T1.ID = T2.ID AND T1.KeyRow = T2.FieldRow