从SQL读取数百万条记录,处理它们并将它们插入另一台SQL服务器

时间:2014-03-06 19:09:15

标签: c# sql sql-server tsql

我有一个具有以下结构的表:

CREATE TABLE [dbo].[case_waveform_data] (
    [case_id]                INT             NOT NULL,
    [channel_index]          INT             NOT NULL,
    [seconds_between_points] REAL            NOT NULL,
    [last_time_stamp]        DATETIME        NOT NULL,
    [value_array]            VARBINARY (MAX) NULL
);

此表将包含数百万条记录。我想通过case_id和channel_index从一个数据库中读取数据,然后通过解压缩value_array数据并将它们链接在一起,将它们组合5分钟,压缩该流然后将组合块添加到另一个数据库。

我的代码在大约100k条记录上运行良好。之后我得到随机错误,例如system.data中的内存不足,CRC与压缩/未压缩数据不匹配,压缩数据中的字符无效。如果我超过100k,它会随机发生。

我使用linq循环遍历记录,但后来直接切换到SqlDataReader。要导入记录,我使用SqlBulkCopy,但我发现即使该部分被注释掉,我也会收到错误。似乎如果我将每个组合记录写入一个文件作为插入代码将完成,但如果我开始将组合记录收集到一个列表中,我将其传递给SqlBulkCopy进行插入,我将得到ranom错误。大多数情况下,当使用linq时,reader()或(或记录中的var记录)行中出现内存不足错误。对于工作集,私有和提交,进程本身的内存大约为80MB。

关于我做错的任何想法?有没有更好的方法来实现这一目标?如果我使用我写的文件,它将达到~300MB,我可以加载一个大小的文件吗?

这是整个功能。它被重写了大约20次,因此可能会有一些奇怪的代码:

using (LiveORDataContext dc = new LiveORDataContext(LiveORDataManager.ConnectionString))
{
    dc.Log = Console.Out;
    dc.ObjectTrackingEnabled = false;

    Stopwatch sw = Stopwatch.StartNew();

    int recordcount = 0;
    // Increase the timeout to 10 minutes for really big cases
    dc.CommandTimeout = 600;
    //Dictionary<int, int> channelindexes = dc.case_waveform_datas.Where(d => d.case_id == livecaseid).GroupBy(d => d.channel_index).ToDictionary(d => d.Key, d => d.Count());

    // get a distict list of all the channel indexes we need to import for this case
    List<int> channelindexes = (from wd in  dc.case_waveform_datas
                                where wd.case_id == livecaseid
                                group wd by wd.channel_index into grp
                                select grp.Key)
                               .ToList();

    // Loop through each channel's data for the case, combine it and compress it
    foreach (int channel in channelindexes)
    {
        List<case_waveform_data> wavedatalist = new List<case_waveform_data>();
        int warehouserecordcount = 0;
        float secondsbetweenpoints = float.NaN;
        DateTime lastaddedrecordtime = DateTime.MinValue;
        DateTime previoustime = DateTime.MinValue;
        List<float> wfpoints = new List<float>();

        string queryString = String.Format("SELECT case_id, channel_index, last_time_stamp, seconds_between_points, " +
                                           "value_array FROM case_waveform_data " +
                                           "WHERE case_id = {0} and channel_index = {1} " +
                                           "ORDER BY last_time_stamp", 
            livecaseid, channel);

        using (SqlConnection connection = new SqlConnection(LiveORDataManager.ConnectionString))
        {
            SqlCommand command = new SqlCommand(queryString, connection);
            connection.Open();

            SqlDataReader reader = command.ExecuteReader();

            // Call Read before accessing data. 
            while (reader.Read()) // Currently fails here
            {
                var item = new
                {
                   case_id = reader.GetInt32(0),
                   channel_index = reader.GetInt32(1),
                   last_time_stamp = reader.GetDateTime(2),
                   seconds_between_points = reader.GetFloat(3),
                   value_array = (byte[])reader["value_array"]
                };                    

        //var wdlist = from wfd in dc.case_waveform_datas
        //    where wfd.case_id == livecaseid && wfd.channel_index == channel
        //    orderby wfd.last_time_stamp
        //    select new
        //           {
        //               wfd.case_id,
        //               wfd.channel_index,
        //               wfd.last_time_stamp,
        //               wfd.seconds_between_points,
        //               wfd.value_array
        //           };

        // Loop through each channel and create floating point arrays that are larger than 
        // per second groups.    
        //foreach (var item in wdlist)
        //{
            // Get a record count for the info log
            recordcount++;

            if (float.IsNaN(secondsbetweenpoints))
            {
                secondsbetweenpoints = item.seconds_between_points > 0.0f
                    ? item.seconds_between_points
                    : 0.002f;
            } // assume .002 as a default if this is not set

            if (lastaddedrecordtime == DateTime.MinValue)
            {
                lastaddedrecordtime = item.last_time_stamp;
            }
            if (previoustime == DateTime.MinValue)
            {
                previoustime = item.last_time_stamp;
            }

            if ((secondsbetweenpoints != item.seconds_between_points && item.seconds_between_points > 0.0f) ||
                item.last_time_stamp > lastaddedrecordtime.AddMinutes(5))
            {
                // The seconds between points has changed so gzip the array of 
                // floats and insert the record.
                var ms = new MemoryStream();
                using (var gZipStream = new GZipStream(ms, CompressionMode.Compress))
                {
                    new BinaryFormatter().Serialize(gZipStream, wfpoints.ToArray());
                }

                // add the new combined record to a list that will be bulk inserted every 1000 records
                wavedatalist.Add(
                    //dcwarehouse.case_waveform_datas.InsertOnSubmit(
                    new case_waveform_data
                    {
                        case_id = warehousecaseid,
                        channel_index = channel,
                        seconds_between_points = secondsbetweenpoints,
                        last_time_stamp = previoustime,
                        value_array = ms.ToArray()
                    });
                if (writeFile) { writer.WriteLine("(@caseid, {0}, {1}, '{2}', 0x{3}),", channel, secondsbetweenpoints, previoustime, BitConverter.ToString(ms.ToArray()).Replace("-", string.Empty)); }
                ms.Close();
                wfpoints.Clear();
                secondsbetweenpoints = item.seconds_between_points;
                lastaddedrecordtime = item.last_time_stamp;

                // To keep memory down submit the changes to the warehouse database more often
                // than after the whole channel's data has been prepared. This handles cases
                // that have run for multiple days
                warehouserecordcount++;
                if (warehouserecordcount > 300)
                {
                    BulkInsertAll(wavedatalist);
                    wavedatalist.Clear();
                    warehouserecordcount = 0;
                    Console.WriteLine("Recordcount: {0}", recordcount);
                }
            }

            // Decompress the float values and append them
            var ms1 = new MemoryStream(item.value_array);
            using (var gZipStream = new GZipStream(ms1, CompressionMode.Decompress))
            {
                // Decompress the float array
                float[] wd = (float[])new BinaryFormatter().Deserialize(gZipStream);

                // determine the timestamp of the first float given the timestamp of the last float,
                // the number of elements and the seconds between floats
                var listfirsttimestamp =
                    item.last_time_stamp.AddSeconds((wd.Length - 1) * secondsbetweenpoints * -1);

                // if the last time of the previous list + the seconds between is still 
                // less than the new list's first time then add in NaNs
                while (previoustime.AddSeconds(secondsbetweenpoints) < listfirsttimestamp)
                {
                    wfpoints.Add(float.NaN);
                    previoustime = previoustime.AddSeconds(secondsbetweenpoints);
                }

                // now append the list
                wfpoints.AddRange(wd);
            }
            ms1.Close();
            previoustime = item.last_time_stamp;

        //}
            }

            // Call Close when done reading.
            reader.Close();
        }
        // If there are any points left for the channel add them here
        if (wfpoints.Any())
        {
            var ms = new MemoryStream();
            using (var gZipStream = new GZipStream(ms, CompressionMode.Compress))
            {
                new BinaryFormatter().Serialize(gZipStream, wfpoints.ToArray());
            }

            wavedatalist.Add(
                new case_waveform_data
                {
                    case_id = warehousecaseid,
                    channel_index = channel,
                    seconds_between_points = secondsbetweenpoints,
                    last_time_stamp = previoustime,
                    value_array = ms.ToArray()
                });
            if (writeFile) { writer.WriteLine("(@caseid, {0}, {1}, '{2}', 0x{3}),", channel, secondsbetweenpoints, previoustime, BitConverter.ToString(ms.ToArray()).Replace("-", string.Empty)); }
            ms.Close();
        }

        if (wavedatalist.Count > 0)
        {
            BulkInsertAll(wavedatalist);
            wavedatalist.Clear();
        }
        Console.WriteLine("Recordcount: {0}", recordcount);
    }

    sw.Stop();
    logger.Info("Livecase: [{0}], Warehouse Caseid: [{1}], Recordcount: [{2}]. Waveform data import took [{3}ms]",
        livecaseid, warehousecaseid, recordcount, sw.ElapsedMilliseconds);
}

if (writeFile)
{
    writer.Close();
}

编辑: 这是其中一个错误。它发生在这一行:

 var item = new
               {
                   case_id = reader.GetInt32(0),
                   channel_index = reader.GetInt32(1),
                   last_time_stamp = reader.GetDateTime(2),
                   seconds_between_points = reader.GetFloat(3),
                   value_array = (byte[])reader["value_array"]
               };

这是堆栈跟踪:

System.InvalidOperationException - Internal connection fatal error.
at System.Data.SqlClient.TdsParserStateObject.TryProcessHeader()
at System.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer()
at System.Data.SqlClient.TdsParserStateObject.TryReadByteArray(Byte[] buff, Int32 offset, Int32 len, Int32& totalRead)
at System.Data.SqlClient.TdsParserStateObject.TryReadPlpBytes(Byte[]& buff, Int32 offst, Int32 len, Int32& totalBytesRead)
at System.Data.SqlClient.TdsParser.TryReadSqlValue(SqlBuffer value, SqlMetaDataPriv md, Int32 length, TdsParserStateObject stateObj)
at System.Data.SqlClient.SqlDataReader.TryReadColumnInternal(Int32 i, Boolean readHeaderOnly)
at System.Data.SqlClient.SqlDataReader.TryReadColumn(Int32 i, Boolean setTimeout, Boolean allowPartiallyReadColumn)
at System.Data.SqlClient.SqlDataReader.GetValueInternal(Int32 i)
at System.Data.SqlClient.SqlDataReader.GetValue(Int32 i)
at System.Data.SqlClient.SqlDataReader.get_Item(String name)
at LiveOR.Data.AccessLayer.LiveORDataManager.ImportWaveformDataLiveToWarehouse(Int32 livecaseid, Int32 warehousecaseid, String backupfilepath) in c:\SRC\LiveOR\LiveOR.Data\LiveORDataManager.cs:line 2416
at VisionSupport.Scheduler.Start() in c:\SRC\LiveOR\VisionSupport\Scheduler.cs:line 90

OutOfMemoryException也发生在上面一行。这是堆栈跟踪:

at System.Data.SqlClient.TdsParserStateObject.TryReadPlpBytes(Byte[]& buff, Int32 offst, Int32 len, Int32& totalBytesRead)
at System.Data.SqlClient.TdsParser.TryReadSqlValue(SqlBuffer value, SqlMetaDataPriv md, Int32 length, TdsParserStateObject stateObj)
at System.Data.SqlClient.SqlDataReader.TryReadColumnInternal(Int32 i, Boolean readHeaderOnly)
at System.Data.SqlClient.SqlDataReader.TryReadColumn(Int32 i, Boolean setTimeout, Boolean allowPartiallyReadColumn)
at System.Data.SqlClient.SqlDataReader.GetValueInternal(Int32 i)
at System.Data.SqlClient.SqlDataReader.GetValue(Int32 i)
at System.Data.SqlClient.SqlDataReader.get_Item(String name)
at LiveOR.Data.AccessLayer.LiveORDataManager.ImportWaveformDataLiveToWarehouse(Int32 livecaseid, Int32 warehousecaseid, String backupfilepath) in c:\SRC\LiveOR\LiveOR.Data\LiveORDataManager.cs:line 2419

编辑2:

这是另一个随机的。我只是通过重新运行相同的代码来获得这些。

行:

float[] wd = (float[])new BinaryFormatter().Deserialize(gZipStream);

例外:

SerializationException: Binary stream '75' does not contain a valid BinaryHeader. Possible causes are invalid stream or object version change between serialization and deserialization.

堆栈追踪:

at System.Runtime.Serialization.Formatters.Binary.__BinaryParser.Run()
at System.Runtime.Serialization.Formatters.Binary.ObjectReader.Deserialize(HeaderHandler handler, __BinaryParser serParser, Boolean fCheck, Boolean isCrossAppDomain, IMethodCallMessage methodCallMessage)
at System.Runtime.Serialization.Formatters.Binary.BinaryFormatter.Deserialize(Stream serializationStream, HeaderHandler handler, Boolean fCheck, Boolean isCrossAppDomain, IMethodCallMessage methodCallMessage)
at System.Runtime.Serialization.Formatters.Binary.BinaryFormatter.Deserialize(Stream serializationStream)
at LiveOR.Data.AccessLayer.LiveORDataManager.ImportWaveformDataLiveToWarehouse(Int32 livecaseid, Int32 warehousecaseid, String backupfilepath) in c:\SRC\LiveOR\LiveOR.Data\LiveORDataManager.cs:line 2516

3 个答案:

答案 0 :(得分:0)

尝试将var ms = new MemoryStream();放入using

请参阅MemoryStream

上的文档

MemoryStream

  

关闭当前流并释放与当前流关联的所有资源(如套接字和文件句柄)。代替   调用此方法,确保正确处理流。   (继承自Stream。)

Stream.Close

  

您可以在using块中声明Stream对象(或使用block in   Visual Basic)确保流及其所有资源   dispos,或者你可以显式调用Dispose方法。

答案 1 :(得分:0)

我原本打算建议您关闭阅读器,但我现在看到了Close()

这里有很多内容,肯定会有第一个看起来的流,但是ADO.Net 4.5有一些new features,可以让你在每行中按顺序读取数据列而不缓冲它们,另外,让你读取一个字节数组而不在内存中缓冲它。

可能值得一读。

答案 2 :(得分:0)

更简单的方法只是Sqlbulkcopy和实体框架反射的一种方式:

从批量处理2000/3000/5000记录中获取和跳过过滤数据开始;

然后使用反射获取数据表并将其传递给事务中的sqlbulkcopy并使用事务来防止问题。

记录每个事务,以便您知道在失败的情况下尚未导入哪条记录。

继续这样做,直到你完成任务。这将花费很短的时间。

这里有一个从实体中检索数据表的示例=注意要传递给以下函数的对象列表是Ienumerable所以当你过滤数据并使用时.ToList不要忘记用这样的语句调用.asEnumerable :

  Lstads.Select(Function(x) x).AsEnumerable

因此您可以将之前查询的结果传递给此函数

   Public Function EQToDataTable(ByVal parIList As System.Collections.IEnumerable) As System.Data.DataTable
    Dim ret As New System.Data.DataTable()
    Try
        Dim ppi As System.Reflection.PropertyInfo() = Nothing
        If parIList Is Nothing Then Return ret
        For Each itm In parIList
            If ppi Is Nothing Then
                ppi = DirectCast(itm.[GetType](), System.Type).GetProperties()
                For Each pi As System.Reflection.PropertyInfo In ppi
                    Dim colType As System.Type = pi.PropertyType

                    If (colType.IsGenericType) AndAlso
                       (colType.GetGenericTypeDefinition() Is GetType(System.Nullable(Of ))) Then colType = colType.GetGenericArguments()(0)

                    ret.Columns.Add(New System.Data.DataColumn(pi.Name, colType))
                Next
            End If
            Dim dr As System.Data.DataRow = ret.NewRow
            For Each pi As System.Reflection.PropertyInfo In ppi
                dr(pi.Name) = If(pi.GetValue(itm, Nothing) Is Nothing, DBNull.Value, pi.GetValue(itm, Nothing))
            Next
            ret.Rows.Add(dr)
        Next
        For Each c As System.Data.DataColumn In ret.Columns
            c.ColumnName = c.ColumnName.Replace("_", " ")
        Next
    Catch ex As Exception
        ret = New System.Data.DataTable()
        Dim lg As New EADSCORE.Helpers.CustomLogger(False)
        lg.WriteLog(ex)
    End Try
    Return ret
End Function

这里有一个使用sqlbulkcopy和事务

的例子
   Public Sub BulkInserTest(ByVal list As System.Collections.IEnumerable)
    Dim hasElement = False
    For Each el In list
        hasElement = True
        Exit For
    Next
    If hasElement = True Then
        Dim dt As DataTable = EQToDataTable(list)

        Using cnn As New SqlClient.SqlConnection(ConfigurationManager.ConnectionStrings("BUCLCNN").ConnectionString)
            cnn.Open()
            Using tr As SqlClient.SqlTransaction = cnn.BeginTransaction
                Using sqlbulk As New SqlClient.SqlBulkCopy(cnn, SqlBulkCopyOptions.KeepIdentity, tr)
                    With sqlbulk
                        .DestinationTableName = "Ads"
                        .BatchSize = 2500
                        For Each el As DataColumn In dt.Columns
                            If el.ColumnName = "IDAds" Or el.ColumnName = "Province" Or el.ColumnName = "SubCategory" Or el.ColumnName = "AdsComments" Or el.ColumnName = "CarDetails" Or el.ColumnName = "HomeDetails" Or el.ColumnName = "Images" Or el.ColumnName = "Customer" Then
                                //not execute
                            Else
                                Dim map As New SqlBulkCopyColumnMapping(el.ColumnName, el.ColumnName)
                                .ColumnMappings.Add(map)
                            End If
                        Next
                        Try
                            If dt.Rows.Count > 0 Then
                                .WriteToServer(dt)
                                tr.Commit()
                            End If
                        Catch ex As Exception
                            tr.Rollback()
                            Dim lg As New EADSCORE.Helpers.CustomLogger(False)
                            lg.WriteLog(ex)
                        End Try
                    End With
                End Using
            End Using
            Dim cmd As New SqlCommand("Update Ads Set Article=replace(Article,'&amp;','&');Update Ads Set Title=replace(Article,'&amp;','&')", cnn)
            cmd.ExecuteNonQuery()
        End Using
    End If

End Sub

上面的代码必须修改,因为有一些额外的过滤器,如果等我的需要,但它也可以工作:)

享受

注意:我不知道您的实体类型是哪个,因此您必须检查映射以确保所有工作正常:)

如果它解决了您的问题,请将其标记为答案