100万行数据的数据清理 - .Net

时间:2015-07-30 13:12:27

标签: .net database excel datatable

我正在构建一个简单的系统来清理一些原始交易数据,使用自建字典扫描某些列中的关键字并对行进行分类。

问题是程序运行缓慢。在一百万行数据集上,需要大约60分钟才能完成。

有没有办法让它跑得更快?这是我的程序框架(用.Net编写):

***使用OleDB连接读取源文件(Excel .xlsx),并使用DataAdapter将其填充到数据表中

Function ReadExcelToDatatable(filepath As String, sourceTblName As String, dataTblName As String) As DataTable
    ReadExcelToDatatable = New DataTable(dataTblName)
    Dim ext As String
    If Right(filepath, 4) = "xlsx" Then ext = "Xml" Else If Right(filepath, 4) = "xlsm" Then ext = "Macro" Else ext = ""
    Try
        Dim conn As New OleDb.OleDbConnection("Provider=Microsoft.ACE.OLEDB.12.0; Data Source=" & filepath & ";Extended Properties = ""Excel 12.0 " & ext & "; HDR=YES; IMEX=1""")
        Dim adapter As New OleDb.OleDbDataAdapter("SELECT * FROM [" & sourceTblName & "]", conn)
        adapter.Fill(ReadExcelToDatatable)
        adapter.Dispose()
        conn.Dispose()
    Catch ex As Exception
        Console.WriteLine(ex)
        Console.WriteLine("Cannot read " & dataTblName & " to data table.")
    End Try
End Function

***对于每个字典项,使用DataTable过滤数据表。选择(过滤,排序)并进行更改

Sub DoFilterTables(rawTable As DataTable, dictTable As DataTable)
    For Each dictRow As DataRow In dictTable.Rows
        Try
            Dim rows As DataRow() = rawTable.Select(dictRow("IF COLUMN NAME 1") & " LIKE '%" & dictRow("KEYWORD 1") & "%'")
            For Each selectedRow As DataRow In rows
                If IsDBNull(selectedRow(CStr(dictRow("THEN COLUMN NAME")))) Then selectedRow(CStr(dictRow("THEN COLUMN NAME"))) = 1 Else selectedRow(CStr(dictRow("THEN COLUMN NAME"))) = dictRow("ASSIGNS KEYWORD 3") + selectedRow(CStr(dictRow("THEN COLUMN NAME")))
                selectedRow.AcceptChanges()
            Next
        Catch ex As Exception
            Console.WriteLine(ex)
            Console.WriteLine("Failed to filter")
        End Try
    Next
End Sub

***逐行将其另存为文本文件

Sub DataTable2CSV(ByVal table As DataTable, ByVal filename As String, _
ByVal sepChar As String)
    Dim writer As System.IO.StreamWriter
    Try
        writer = New System.IO.StreamWriter(filename)
        Dim str As String = ""
        Dim builder As New System.Text.StringBuilder
        For Each col As DataColumn In table.Columns
            str = str & col.ColumnName & sepChar
        Next
        str = str & vbCrLf
        writer.Write(str)
        Dim str2 As String = ""
        Dim ct As Long = 0
        For Each row As DataRow In table.Rows
            str2 = ""
            For Each col As DataColumn In table.Columns
                Try
                    str2 = str2 & CStr(row(col.ColumnName)) & sepChar
                Catch ex As Exception
                    str2 = str2 & sepChar
                End Try
            Next
            str2 = str2 & vbCrLf
            writer.Write(str2)
        Next
    Finally
    End Try
    writer.Flush()
    writer.Close()
End Sub

结束模块

任何输入都将不胜感激。谢谢!

修改

事实证明,95%的时间用于使用OleDB和DataAdapter将Excel工作表读入数据表..

是OleDB - > DataAdapter是最有效的方法吗?

将CSV - > DataTable是一种更快的方式吗?

在性能方面,Interops怎么样?

1 个答案:

答案 0 :(得分:0)

我的计算机上碰巧有一个5GB的csv文件,里面有400多万行,所以我编写了一个例程来读取和删除第一个没有处理的百万行。在我的电脑上,这需要7秒钟。

为了逐行处理文件,您可以使用类似于以下内容的代码段:

        Dim counter As Integer
        Dim lStart, lEnd As Long
        lStart = Environment.TickCount
        Using r = System.IO.File.AppendText("C:\...\test.csv")
            For Each line As String In System.IO.File.ReadLines("C:\...\source.csv")


                r.WriteLine(line)

                counter += 1
                If counter = 1000000 Then
                    Exit For

                End If
            Next
        End Using
        lEnd = Environment.TickCount
        MsgBox("done: " & (lEnd - lStart))