我正在构建一个简单的系统来清理一些原始交易数据,使用自建字典扫描某些列中的关键字并对行进行分类。
问题是程序运行缓慢。在一百万行数据集上,需要大约60分钟才能完成。
有没有办法让它跑得更快?这是我的程序框架(用.Net编写):
***使用OleDB连接读取源文件(Excel .xlsx),并使用DataAdapter将其填充到数据表中
Function ReadExcelToDatatable(filepath As String, sourceTblName As String, dataTblName As String) As DataTable
ReadExcelToDatatable = New DataTable(dataTblName)
Dim ext As String
If Right(filepath, 4) = "xlsx" Then ext = "Xml" Else If Right(filepath, 4) = "xlsm" Then ext = "Macro" Else ext = ""
Try
Dim conn As New OleDb.OleDbConnection("Provider=Microsoft.ACE.OLEDB.12.0; Data Source=" & filepath & ";Extended Properties = ""Excel 12.0 " & ext & "; HDR=YES; IMEX=1""")
Dim adapter As New OleDb.OleDbDataAdapter("SELECT * FROM [" & sourceTblName & "]", conn)
adapter.Fill(ReadExcelToDatatable)
adapter.Dispose()
conn.Dispose()
Catch ex As Exception
Console.WriteLine(ex)
Console.WriteLine("Cannot read " & dataTblName & " to data table.")
End Try
End Function
***对于每个字典项,使用DataTable过滤数据表。选择(过滤,排序)并进行更改
Sub DoFilterTables(rawTable As DataTable, dictTable As DataTable)
For Each dictRow As DataRow In dictTable.Rows
Try
Dim rows As DataRow() = rawTable.Select(dictRow("IF COLUMN NAME 1") & " LIKE '%" & dictRow("KEYWORD 1") & "%'")
For Each selectedRow As DataRow In rows
If IsDBNull(selectedRow(CStr(dictRow("THEN COLUMN NAME")))) Then selectedRow(CStr(dictRow("THEN COLUMN NAME"))) = 1 Else selectedRow(CStr(dictRow("THEN COLUMN NAME"))) = dictRow("ASSIGNS KEYWORD 3") + selectedRow(CStr(dictRow("THEN COLUMN NAME")))
selectedRow.AcceptChanges()
Next
Catch ex As Exception
Console.WriteLine(ex)
Console.WriteLine("Failed to filter")
End Try
Next
End Sub
***逐行将其另存为文本文件
Sub DataTable2CSV(ByVal table As DataTable, ByVal filename As String, _
ByVal sepChar As String)
Dim writer As System.IO.StreamWriter
Try
writer = New System.IO.StreamWriter(filename)
Dim str As String = ""
Dim builder As New System.Text.StringBuilder
For Each col As DataColumn In table.Columns
str = str & col.ColumnName & sepChar
Next
str = str & vbCrLf
writer.Write(str)
Dim str2 As String = ""
Dim ct As Long = 0
For Each row As DataRow In table.Rows
str2 = ""
For Each col As DataColumn In table.Columns
Try
str2 = str2 & CStr(row(col.ColumnName)) & sepChar
Catch ex As Exception
str2 = str2 & sepChar
End Try
Next
str2 = str2 & vbCrLf
writer.Write(str2)
Next
Finally
End Try
writer.Flush()
writer.Close()
End Sub
结束模块
任何输入都将不胜感激。谢谢!
修改
事实证明,95%的时间用于使用OleDB和DataAdapter将Excel工作表读入数据表..
是OleDB - > DataAdapter是最有效的方法吗?
将CSV - > DataTable是一种更快的方式吗?
在性能方面,Interops怎么样?
答案 0 :(得分:0)
我的计算机上碰巧有一个5GB的csv文件,里面有400多万行,所以我编写了一个例程来读取和删除第一个没有处理的百万行。在我的电脑上,这需要7秒钟。
为了逐行处理文件,您可以使用类似于以下内容的代码段:
Dim counter As Integer
Dim lStart, lEnd As Long
lStart = Environment.TickCount
Using r = System.IO.File.AppendText("C:\...\test.csv")
For Each line As String In System.IO.File.ReadLines("C:\...\source.csv")
r.WriteLine(line)
counter += 1
If counter = 1000000 Then
Exit For
End If
Next
End Using
lEnd = Environment.TickCount
MsgBox("done: " & (lEnd - lStart))