我有一个应用程序逐行读取5gb文本文件,并将逗号分隔的双引号字符串转换为管道分隔格式。 即“史密斯,约翰”,“雪,约翰” - >史密斯,约翰|斯诺,约翰
我在下面提供了我的代码。我的问题是:是否有更有效的方法来处理大文件?
Dim fName As String = "C:\LargeFile.csv"
Dim wrtFile As String = "C:\ProcessedFile.txt"
Dim strRead As New System.IO.StreamReader(fName)
Dim strWrite As New System.IO.StreamWriter(wrtFile)
Dim line As String = ""
Do While strRead.Peek <> -1
line = strRead.ReadLine
Dim pattern As String = "(,)(?=(?:[^""]|""[^""]*"")*$)"
Dim replacement As String = "|"
Dim regEx As New Regex(pattern)
Dim newLine As String = regEx.Replace(line, replacement)
newLine = newLine.Replace(Chr(34), "")
strWrite.WriteLine(newLine)
Loop
strWrite.Close()
更新代码
Dim fName As String = "C:\LargeFile.csv"
Dim wrtFile As String = "C:\ProcessedFile.txt"
Dim strRead As New System.IO.StreamReader(fName)
Dim strWrite As New System.IO.StreamWriter(wrtFile)
Dim line As String = ""
Do While strRead.Peek <> -1
line = strRead.ReadLine
line = line.Replace(Chr(34) + Chr(44) + Chr(34), "|")
line = line.Replace(Chr(34), "")
strWrite.WriteLine(line)
Loop
strWrite.Close()
答案 0 :(得分:1)
我测试了你的代码并尝试通过将输出行累积到StringBuilder中来提高速度。我还在循环之外移动了正则表达式声明。
当这不起作用时,我用Windows Process Monitor检查了CPU使用率和磁盘I / O,结果证明瓶颈是CPU(即使使用HDD而不是SSD)。
这促使我尝试了一种修改文本的替代方法:如果你需要做的就是用","
替换|
并删除任何剩余的双引号,那么
newLine = line.Replace(""",""", "|").Replace("""", "")
比使用正则表达式要快得多(在我的测试中大约是四倍)。
(正如@Werdna建议的那样,多线程可以进一步改进,只要有多个处理器可用,并且您可以按正确的顺序协调写回修改的数据。)