我已经阅读了一些关于加速处理大型CSV数据的问题。我已经实现了一些想法,并且我在处理时间上有了一些改进。但是我仍然需要进一步缩短处理时间,希望有人可以帮助我。
我认为我的代码太长了,我会尽量简化。这是我的代码所要做的:
1.通读csv文件
2.按第一列分组数据;计算每列的总和并返回结果。
示例(原始数据):
ABC ABC 1 2 3
1 2 3
2 4 4
2 4 4
结果:
ABC ABC 1 4 6
2 8 8
注:我的实际数据为100MB文件,包含630列和29000行,共18.27M条记录。
以下是我如何实现它:
方法1:
1.通过 Filestream 读取csv文件
2.使用拆分拆分返回的字符串并逐字段逐行处理
3.将结果存储在数组中并将结果保存在文本文件中。
关于Method1的注意事项:使用此方法处理数据的时间需要约1分20秒。
方法2:
1。通过 Filestream 读取csv文件。
2。在启动过程之前将数据提供给不同的线程。 (现在我将100行数据提供给不同的线程,由于CPU资源限制,现在修复5个线程)
3。使用拆分拆分返回的字符串并逐行逐行处理。
4。加入每个线程的所有结果并存储在数组中。将结果保存在文本文件中。
关于方法2的注意事项:使用此方法处理数据的时间大约需要50秒。
所以我得到~30秒的改进从方法1 迁移到方法2 。我想知道我能做些什么来进一步改善处理时间。我试图将数据减少到较小的部分,如100行×100列并处理它,但处理数据的时间变得更长。
祝希望有人可以帮助我。< br />
提前谢谢。
编辑:
这是我的代码方法2 (我' ll跳过方法1 ,因为我还没有使用它),我有一个子程序,管理从文件流中读取的每100行的线程分配,执行每个线程并返回结果,最后更新在将结果写入文件之前,将所有结果转换为单个数组。我试图让代码尽可能简单。希望这会让我更多地了解我如何处理我的数据。
'Subroutine that assign smaller section of raw data into different threads
Sub process_control(byval filename as string)
Dim sread As New FileStream(filename, FileMode.Open, FileAccess.Read, FileShare.Read)
Dim read As New StreamReader(sread)
Dim t1 As System.Threading.Thread
Dim value, data1(), data2(), data3(), data4(), data5(), threadid(), result1(0), result2(0), result3(0), result4(0), result5(0) As String
Dim row as integer
Dim rowlimit as integer = 99
Dim check1 as boolean = true
row = 0
check = false
ReDim data1(rowlimit), data2(rowlimit), data3(rowlimit), data4(rowlimit), data5(rowlimit), threadid(4)
do value = read.ReadLine If row < rowlimit + 1 then If data1(rowlimit) = "" Then data1(row) = value ElseIf data2(rowlimit) = "" Then data2(row) = value ElseIf data3(rowlimit) = "" Then data3(row) = value ElseIf data4(rowlimit) = "" Then data4(row) = value ElseIf data5(rowlimit) = "" Then data5(row) = value End If Else If data1(rowlimit) <> "" And data2(rowlimit) = "" And data3(rowlimit) = "" And data4(rowlimit) = "" And data5(rowlimit) = "" Then threadid(0) = "" t1 = New Threading.Thread(Sub() result1 = process(data1).Clone threadid(0) = System.Threading.Thread.CurrentThread.ManagedThreadId End Sub) t1.Start() row = 0 data2(row) = value ElseIf data1(rowlimit) <> "" And data2(rowlimit) <> "" And data3(rowlimit) = "" And data4(rowlimit) = "" And data5(rowlimit) = "" Then threadid(1) = "" t1 = New Threading.Thread(Sub() result2 = process(data2).Clone threadid(1) = System.Threading.Thread.CurrentThread.ManagedThreadId End Sub) t1.Start() row = 0 data3(row) = value ElseIf data1(rowlimit) <> "" And data2(rowlimit) <> "" And data3(rowlimit) <> "" And data4(rowlimit) = "" And data5(rowlimit) = "" Then threadid(2) = "" t1 = New Threading.Thread(Sub() result3 = process(data3).Clone threadid(2) = System.Threading.Thread.CurrentThread.ManagedThreadId End Sub) t1.Start() row = 0 data4(row) = value ElseIf data1(rowlimit) <> "" And data2(rowlimit) <> "" And data3(rowlimit) <> "" And data4(rowlimit) <> "" And data5(rowlimit) = "" Then threadid(3) = "" t1 = New Threading.Thread(Sub() result4 = process(data4).Clone threadid(3) = System.Threading.Thread.CurrentThread.ManagedThreadId End Sub) t1.Start() row = 0 data5(row) = value ElseIf data1(rowlimit) <> "" And data2(rowlimit) <> "" And data3(rowlimit) <> "" And data4(rowlimit) <> "" And data5(rowlimit) <> "" Then threadid(4) = "" t1 = New Threading.Thread(Sub() result5 = process(data5).Clone threadid(4) = System.Threading.Thread.CurrentThread.ManagedThreadId End Sub) t1.Start() row = 0 check1 = True End If row += 1 End If
If check1 = True Then Do System.Threading.Thread.Sleep(100) Loop Until threadid(0) <> "" And threadid(1) <> "" And threadid(2) <> "" And threadid(3) <> "" And threadid(4) <> "" row = 0 ReDim data1(rowlimit) data1(row) = value row += 1 result1_update(result1) ' consolidate result into a single array result2_update(result2) ' consolidate result into a single array result3_update(result3) ' consolidate result into a single array result4_update(result4) ' consolidate result into a single array result5_update(result5) ' consolidate result into a single array check1 = False ReDim data2(rowlimit), data3(rowlimit), data4(rowlimit), data5(rowlimit) End If
loop until read.endofstream
end sub
' Function that calculate the sum of each row and columns
Function process(ByVal data() As String) As String() Dim line(), line1(), result() As String Dim check As Boolean redim result(0)
For n = 0 To (data.Count - 1) if result(0) = "" and result.count = 1 then result(result.count-1) = data(n) else check = true line1 = Split(data(n), ",", -1, CompareMethod.Text) For m = 0 to (result.count-1) line = split(result(m),",",-1, CompareMethod.Text) if line1(0) = line(0) then check = false for o = 1 to (line1.count-1) line(o) = val(line1(o)) + val(line(o)) next o result(m) = join(line,",") exit for end if Next m
if check = true then redim preserve result(result.count) result(result.count-1) = join(line1,",") end if end if Next n
redim preserve result(result.count-2) process = result.clone End Function
答案 0 :(得分:0)
看看你的代码,我注意到了一些事情:
您正在使用Val
,它非常易于使用,但开销很高。 Integer.Parse
可以更有效地工作。
您正在从字符串转换为数字字符串,而不是您需要的字符串。由于您的摘要只是完整数据大小的一小部分,因此您不应该将结果存储在内存中。 Dictionary(Of Integer, Integer())
适用于此。
考虑这个代码,它将读取数据,对其进行汇总,并将数据放入一个易于写入文件的格式,所有这些都在不到10秒的时间内完成。使用高达1000的随机整数:
Function SummarizeData(filename As String, delimiter As Char) As Dictionary(Of Integer, Integer())
Dim limit As Integer = 0
SummarizeData = New Dictionary(Of Integer, Integer())
Using sr As New IO.StreamReader(filename)
'Since we don't need the first line for the summary we can read it get _
'the upper bound for the array, and discard the line.
If Not sr.EndOfStream Then
limit = sr.ReadLine.Split(delimiter).Length - 1
Else : Throw New Exception("Empty File")
End If
Do Until sr.EndOfStream
'This creates an array of integers representing the data in one line.
Dim line = sr.ReadLine.Split(" "c).Select(Function(x) Integer.Parse(x)).ToArray
'If the key is already in the dictionary we increment the values
If SummarizeData.ContainsKey(line(0)) Then
For I = 1 To limit
SummarizeData.Item(line(0))(I) += line(I)
Next
Else
'If not we create a new element using the line as the initial values
SummarizeData.Add(line(0), New Integer(limit) {})
SummarizeData.Item(line(0)) = line
End If
Loop
End Using
End Function
要使用该功能并写入数据,这将起作用:
Dim results = SummarizeData("data.txt", ","c)
'If you don't need the results sorted you can gain a few fractions of a second by _
'removing the Order By clause
IO.File.WriteAllLines("results.txt", (From kvp In results
Order By kvp.Key
Select String.Join(",", kvp.Value)).ToArray)