将非结构化数据转换为表或视图

时间:2016-02-24 22:16:50

标签: tsql structured-data

目前我正在使用提供者构建的SQL Server DB。该DB具有通过其系统进行的呼叫进入的数据。存储数据的主表有7个字段。 1个字段是主键,然后是2个外键,一对数据时间戳,最后是一个大量的字段调用" SergmentLog"

在此字段中,数据来自非结构化。以下是数据的示例:

/20160219T154710.554-07/0?S=50&E=3512&CUTC=20160219T155235.662-07&1=100187177120160219&2=0&3=18823&4=user%20queue:icadmin&5=&6=Interact&7=|/20160219T154729.377-07/0?S=50&E=3504&CUTC=20160219T155235.663-07&1=100187177120160219&2=0&3=81592&4=user%20queue:icadmin&5=&6=LocalTransfer&7=%3cDetails%20TransferringUser%3d%22ICadmin%20-%22%20TransferringInteractionId%3d%22100187177120160219%22%20TransferredInteractionId%3d%22100187177120160219%22%20/%3e%0a&8=&9=2|/20160219T154850.970-07/0?S=50&E=3502&CUTC=20160219T155235.663-07&1=100187177120160219&2=0&3=55&4=&5=workgroup%20queue:Central%20Ops%202&6=LocalTransfer&7=%3cDetails%20TransferringUser%3d%22ICadmin%20-%22%20TransferringInteractionId%3d%22100187177120160219%22%20TransferredInteractionId%3d%22100187177120160219%22%20TransferredUser%3d%22Phoenix%20AZ%22%20/%3e%0a|/20160219T154851.025-07/0?S=50&E=3500&CUTC=20160219T155235.664-07&1=100187177120160219&2=0&3=1048&4=&5=&6=Queue&7=%3cDetails%20IVRAppName%3d%22Central%20Ops%202%22%20/%3e%0a|/20160219T154852.073-07/0?S=50&E=3502&CUTC=20160219T155235.664-07&1=100187177120160219&2=0&3=13344&4=&5=workgroup%20queue:Central%20Ops%202&6=Interact&7=|/20160219T154905.417-07/0?S=50&E=3504&CUTC=20160219T155235.664-07&1=100187177120160219&2=0&3=26202&4=user%20queue:icadmin&5=workgroup%20queue:Central%20Ops%202&6=LocalDisconnect&7=&8=&9=5

我被告知的是每个" SegmentLog"可以有多个"事件",称为" E ="在SegmentLog字段中。每个活动都由" |"管道符号。但是在每个偶数之前,有一个来自服务器的数据时间戳,然后是一个SourceID(称为" S ="),然后是最终的EventID(被称为" E =")

在每个EventID之后(3500-3512之间的数字)将有1-9的属性编号(被叫" 1 ="," 2 ="等等)。

请记住,每个SegmentLog可能有多个事件具有相同的EventID,并且并非所有属性都会显示在每个EventID中(IE E = 3502可能只显示属性1-6,而E = 3503可能显示属性1- 9)将这些数据构建到表结构中的最佳方法是什么。我可用的工具是在视图或中间SSIS知识内构建复杂的搜索查询。

修改

我希望看到这样的数据。但包括所有属性:

DateTime                    Sequence  EventID  Attr1                  Attr3  
--------                    --------  -------  -----                  -----
/20160219T154710.554-07/0?  s=50      &E=3512  &1=100187177120160219  &3=18823
/20160219T154729.377-07/0?  S=50      &E=3504  &1=100187177120160219  &3=81592
/20160219T154850.970-07/0?  S=50      &E=3502  &1=100187177120160219  &3=55
/20160219T154851.025-07/0?  S=50      &E=3500  &1=100187177120160219  &3=1048

1 个答案:

答案 0 :(得分:0)

好的,我认为这是你想要完成的事情。

为了测试这个,我将您的示例行添加到SQL Server表nvarchar(max)列:

if exists (select * from sysobjects where name='BigLongString' and xtype='U')
drop table dbo.BigLongString;
go

create table dbo.BigLongString
( 
 SegmentLog nvarchar(max)
);
go

insert into dbo.BigLongString (SegmentLog)
values ('/20160219T154710.554-07/0?S=50&E=3512&CUTC=20160219T155235.662-07&1=100187177120160219&2=0&3=18823&4=user%20queue:icadmin&5=&6=Interact&7=|/20160219T154729.377-07/0?S=50&E=3504&CUTC=20160219T155235.663-07&1=100187177120160219&2=0&3=81592&4=user%20queue:icadmin&5=&6=LocalTransfer&7=%3cDetails%20TransferringUser%3d%22ICadmin%20-%22%20TransferringInteractionId%3d%22100187177120160219%22%20TransferredInteractionId%3d%22100187177120160219%22%20/%3e%0a&8=&9=2|/20160219T154850.970-07/0?S=50&E=3502&CUTC=20160219T155235.663-07&1=100187177120160219&2=0&3=55&4=&5=workgroup%20queue:Central%20Ops%202&6=LocalTransfer&7=%3cDetails%20TransferringUser%3d%22ICadmin%20-%22%20TransferringInteractionId%3d%22100187177120160219%22%20TransferredInteractionId%3d%22100187177120160219%22%20TransferredUser%3d%22Phoenix%20AZ%22%20/%3e%0a|/20160219T154851.025-07/0?S=50&E=3500&CUTC=20160219T155235.664-07&1=100187177120160219&2=0&3=1048&4=&5=&6=Queue&7=%3cDetails%20IVRAppName%3d%22Central%20Ops%202%22%20/%3e%0a|/20160219T154852.073-07/0?S=50&E=3502&CUTC=20160219T155235.664-07&1=100187177120160219&2=0&3=13344&4=&5=workgroup%20queue:Central%20Ops%202&6=Interact&7=|/20160219T154905.417-07/0?S=50&E=3504&CUTC=20160219T155235.664-07&1=100187177120160219&2=0&3=26202&4=user%20queue:icadmin&5=workgroup%20queue:Central%20Ops%202&6=LocalDisconnect&7=&8=&9=5')
go

然后我创建了一个SSIS包来提取这些数据并解析它。数据流任务如下所示: Data Flow Task  
OLE DB Source组件中的SQL语句是:

select 
      SegmentLog 
from 
      dbo.BigLongString;

脚本组件是一个转换,具有异步输出:

Inputs and Outputs Form

如果展开“输出0”树,则可以看到添加的所有列。 Attr *列都是dt_wstr 500.我不确定它们有多大,所以你可能想要改变数据类型。我刚刚制作的其余列dt_wstr 50:

Output Columns

以下是脚本组件的代码。确保在退出之前构建:

 #region Namespaces
 using System;
 using System.Data;
 using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
 using Microsoft.SqlServer.Dts.Runtime.Wrapper;
 using Microsoft.SqlServer.Dts.Pipeline;
 #endregion

 [Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
 public class ScriptMain : UserComponent
 {
   private PipelineBuffer inputBuffer;

 public override void Input0_ProcessInputRow(Input0Buffer Row)
 {

    //length of blob
    int blobLen = 0;
    //the bytes of the blob
    byte[] webBlob = null;

    string webStr = null;

    string[] dateSplit = new string[] { "|" };

    //get blob length. Hardcoded to 0 since we only look at one column
    //in this example
    blobLen = (int)inputBuffer.GetBlobLength(0);

    //gets string from blob, hardcoded columnindex since we only have 1 column
    webStr = ConvertBlobToString((byte[])inputBuffer.GetBlobData(0, 0, blobLen));

    //holds value for dates in string
    string[] dates = webStr.Split(dateSplit, StringSplitOptions.None);

    //Loop through each date
    foreach (string date in dates)
    {
        //Parse out each attribute for a given date
        string[] attributes = date.Split('&');

        Output0Buffer.AddRow();

        //Loop through each attribute in date, you can remove the "&"+ if you do not need these in the values
        for (int i = 0; i < attributes.Length; i++)
        {

            switch (i)
            {
                case 0:
                    Output0Buffer.DateTime = attributes[i].Substring(0, attributes[i].IndexOf('S'));
                    Output0Buffer.Sequence = attributes[i].Substring(attributes[i].IndexOf('S'), attributes[i].Length - attributes[i].IndexOf('S'));
                    break;
                case 1:
                    Output0Buffer.EventID = "&" + attributes[i];
                    break;
                case 2:
                    Output0Buffer.CUTC = "&" + attributes[i];
                    break;
                case 3:
                    Output0Buffer.Attr1 = "&" + attributes[i];
                    break;
                case 4:
                    Output0Buffer.Attr2 = "&" + attributes[i];
                    break;
                case 5:
                    Output0Buffer.Attr3 = "&" + attributes[i];
                    break;
                case 6:
                    Output0Buffer.Attr4 = "&" + attributes[i];
                    break;
                case 7:
                    Output0Buffer.Attr5 = "&" + attributes[i];
                    break;
                case 8:
                    Output0Buffer.Attr6 = "&" + attributes[i];
                    break;
                case 9:
                    Output0Buffer.Attr7 = "&" + attributes[i];
                    break;
                case 10:
                    Output0Buffer.Attr8 = "&" + attributes[i];
                    break;
                case 11:
                    Output0Buffer.Attr9 = "&" + attributes[i];
                    break;
            }
        }

    }
}

public override void ProcessInput(int InputID, Microsoft.SqlServer.Dts.Pipeline.PipelineBuffer Buffer)
{
    inputBuffer = Buffer;
    base.ProcessInput(InputID, Buffer);
}

public string ConvertBlobToString(byte[] webBlob)
{
    //string to return
    string webStr = null;

    //get string from blob
    webStr = System.Text.Encoding.Unicode.GetString(webBlob);

    return webStr;

}

}

运行包,您应该看到在数据查看器中按预期解析出的数据:

Data Viewer