SQL将多个多值列拆分为行

时间:2015-07-14 14:03:34

标签: sql parsing split multivalue

我有发送给我的数据,我需要将其标准化。数据位于sql表中,但每行都有多个多值列。一个例子如下:

ID  fname   lname       projects           projdates
1   John    Doe         projA;projB;projC  20150701;20150801;20150901
2   Jane    Smith       projD;;projC       20150701;;20150902
3   Lisa    Anderson    projB;projC        20150801;20150903
4   Nancy   Johnson     projB;projC;projE  20150601;20150822;20150904
5   Chris   Edwards     projA              20150905

需要看起来像这样:

ID  fname   lname      projects projdates
1   John    Doe          projA  20150701
1   John    Doe          projB  20150801
1   John    Doe          projC  20150901
2   Jane    Smith        projD  20150701
2   Jane    Smith        projC  20150902
3   Lisa    Anderson     projB  20150801
3   Lisa    Anderson     projC  20150903
4   Nancy   Johnson      projB  20150601
4   Nancy   Johnson      projC  20150822
4   Nancy   Johnson      projE  20150904
5   Chris   Edwards      projA  20150905

我需要将其拆分为id,fname,lname和解析项目的行,并将其分成不同的记录。我发现很多帖子都有分割功能,我可以让它适用于1列,但不是2.当我做2列时,它会渗透到分割中。即对于John Doe,它给了我3次projA的记录,每次为每个proddates。我需要将每个多值项目记录与其相应的projdate而不是其他项目进行协调。

有什么想法吗?

谢谢!

3 个答案:

答案 0 :(得分:1)

如果您使用Jeff Moden" DelimitedSplit8K" splitter(我在这里重命名了#34; fDelimitedSplit8K") (参见图21: The Final" New" Splitter Code,Ready for Testing
为了对分裂进行繁重的工作,其余部分变得相当简单,使用CROSS APPLY和WHERE来正确连接。

IF object_ID (N'tempdb..#tInputData') is not null 
   DROP TABLE #tInputData

CREATE TABLE #tInputData (
     ID        INT 
        PRIMARY KEY CLUSTERED  -- Add IDENTITY if ID needs to be set at INSERT time
   , FName     VARCHAR (30)
   , LName     VARCHAR (30)
   , Projects  VARCHAR (4000)
   , ProjDates VARCHAR (4000)
)

INSERT INTO #tInputData
         ( ID, FName, LName, Projects, ProjDates )
VALUES
   ( 1, 'John',  'Doe'      , 'projA;projB;projC' , '20150701;20150801;20150901'),
   ( 2, 'Jane',  'Smith'    , 'projD;;projC'      , '20150701;;20150902'),
   ( 3, 'Lisa',  'Anderson' , 'projB;projC'       , '20150801;20150903'),
   ( 4, 'Nancy', 'Johnson'  , 'projB;projC;projE' , '20150601;20150822;20150904'),
   ( 5, 'Chris', 'Edwards'  , 'projA'             , '20150905')

SELECT * FROM #tInputData  -- Take a look at the INSERT results

; WITH ResultSet  AS 
(
   SELECT 
        InData.ID
      , InData.FName
      , InData.LName
      , ProjectList.ItemNumber AS ProjectID
      , ProjectList.Item AS Project
      , DateList.ItemNumber AS DateID
      , DateList.Item AS ProjDate
   FROM #tInputData AS InData
   CROSS APPLY dbo.fDelimitedSplit8K(InData.Projects,';') AS ProjectList
   CROSS APPLY dbo.fDelimitedSplit8K(InData.ProjDates,';') AS DateList
   WHERE DateList.ItemNumber = ProjectList.ItemNumber  -- Links projects and dates in left-to-r1ght order
   AND (ProjectList.Item <> '' AND DateList.Item <> '') -- Ignore input lines when both Projects and ProjDates have no value; note that these aren't NULLs.
)
SELECT 
      ID
    , FName
    , LName
    , Project
    , ProjDate 
FROM ResultSet
ORDER BY ID, Project

结果

ID  FName  LName     Project  ProjDate  
--  -----  --------  -------  --------  
 1  John   Doe       projA    20150701  
 1  John   Doe       projB    20150801  
 1  John   Doe       projC    20150901  
 2  Jane   Smith     projC    20150902  
 2  Jane   Smith     projD    20150701  
 3  Lisa   Anderson  projB    20150801  
 3  Lisa   Anderson  projC    20150903  
 4  Nancy  Johnson   projB    20150601  
 4  Nancy  Johnson   projC    20150822  
 4  Nancy  Johnson   projE    20150904  
 5  Chris  Edwards   projA    20150905  

此算法处理等长的Project和Date列表。对于给定的行,如果一个列表比另一个列表短,则需要特别注意在适当的位置应用NULL。

-- Cleanup
DROP TABLE #tInputData

答案 1 :(得分:0)

你没有说出你预期的结果是什么,但这可能是一个很好的起点:

declare @t table (ID int not null,fname varchar(17) not null,lname varchar(15) not null,
projects varchar(76) not null,projdates varchar(310) not null)
insert into @t(ID,fname,lname,projects,projdates) values
(1,'John', 'Doe',     'projA;projB;projC','20150701;20150801;20150901'),
(2,'Jane', 'Smith',   'projD;;projC',     '20150701;;20150902'        ),
(3,'Lisa', 'Anderson','projB;projC',      '20150801;20150903'         ),
(4,'Nancy','Johnson', 'projB;projC;projE','20150601;20150822;20150904'),
(5,'Chris','Edwards', 'projA',            '20150905'                  )

;With Numbers as (
    select ROW_NUMBER() OVER (ORDER BY Number) n
    from master..spt_values
), ProjectPositions as (
    select ID,n.n
    from @t t
        inner join
        Numbers n
            on SUBSTRING(t.projects,n.n,1) = ';'
    union all
    select ID,0 from @t
    union all
    select ID,LEN(projects)+1 from @t
), ProjectsNumbered as (
    select *,ROW_NUMBER() OVER (PARTITION BY ID ORDER BY n) rn
    from ProjectPositions
), ProjectPartitions as (
    select n1.ID,n1.n+1 as startat,n2.n as endat,n1.rn
    from ProjectsNumbered n1
            inner join
        ProjectsNumbered n2
            on
                n1.id = n2.id and
                n1.rn = n2.rn -1
), ProDatePositions as (
    select ID,n.n
    from @t t
        inner join
        Numbers n
            on SUBSTRING(t.projdates,n.n,1) = ';'
    union all
    select ID,0 from @t
    union all
    select ID,LEN(projdates)+1 from @t
), ProDateNumbered as (
    select *,ROW_NUMBER() OVER (PARTITION BY ID ORDER BY n) rn
    from ProDatePositions
), ProDatePartitions as (
    select n1.ID,n1.n+1 as startat,n2.n as endat,n1.rn
    from ProDateNumbered n1
            inner join
        ProDateNumbered n2
            on
                n1.id = n2.id and
                n1.rn = n2.rn -1
)
select
    t.ID,t.fname,t.lname,
    SUBSTRING(projects,pp.startat,pp.endat - pp.startat) as project,
    SUBSTRING(projdates,pdp.startat,pdp.endat - pdp.startat) as projdate
from
    @t t
        inner join
    ProjectPartitions pp
        on
            t.ID = pp.ID
        inner join
    ProDatePartitions pdp
        on
            t.ID = pdp.ID and
            pp.rn = pdp.rn

结果:

ID          fname             lname           project     projdate
----------- ----------------- --------------- ----------- ----------
1           John              Doe             projA       20150701
1           John              Doe             projB       20150801
1           John              Doe             projC       20150901
2           Jane              Smith           projD       20150701
2           Jane              Smith                       
2           Jane              Smith           projC       20150902
3           Lisa              Anderson        projB       20150801
3           Lisa              Anderson        projC       20150903
4           Nancy             Johnson         projB       20150601
4           Nancy             Johnson         projC       20150822
4           Nancy             Johnson         projE       20150904
5           Chris             Edwards         projA       20150905

(目前还不清楚你想为ID 2的“空”项目做些什么

工作原理 - 我们使用Numbers假设ROW_NUMBER()表 - 我们在master查询未记录的表,但我们没有使用表中的任何实际值 - 只知道有很多行。如果您有一个实数表,则可以跳过该CTE。

然后我们做两次相同的操作 - 我们将数字表连接到我们的数据表,并使用它来查找我们想要拆分的字符串中;个字符的位置。我们还为位置0(在字符串开始之前)和在字符串结尾之后的1位置创建一对虚拟结果。这定义了ProjectPositionsProDatePositions

我们使用其他ROW_NUMBER()ProjectNumberedProDateNumbered对这些位置进行编号,然后使用该信息将连续的行连接在一起(ProjectPartitionsProDatePartitions)。然后最终结果是我们计算了从两个字符串中提取子字符串的位置。

最后,我们将这些“paritition”CTE加入到原始数据表中,我们使用行号来确​​保我们对齐来自两个独立字符串的分区信息。

答案 2 :(得分:0)

尝试以下查询。

SELECT A.ID,a.fname,a.lname,a.projects,      ltrim(Split.a.value('。','VARCHAR(100)'))AS projdates
 FROM(SELECT ID,fname,lname,projects,          CAST(''+ REPLACE([projdates],';','')+''AS XML)AS String
     FROM)作为交叉应用String.nodes('/ M')AS Split(a);

尝试使用此功能,您将获得预期的输出。

感谢。