将.csv文件有效地附加在一起(VB.NET)

时间:2019-07-11 06:44:38

标签: vb.net csv append

我有一个问题要在Visual Basic中提高效率。我想做的是以下事情:

  • 我有一个包含100个.csv文件(以逗号分隔)的文件夹,这些文件有大约5000行和大约200列。列的顺序可能因一个文件而异,并且某些文件中缺少某些列。
  • 我的目标是创建一个大的.csv文件,该文件将所有100个.csv文件组合在一起,并带有我预先指定的选择列。
  • 这是我的处理方法:

    1. 创建一个数组,以将所需的列名存储在最终的“大.csv”中
    2. 浏览文件夹中的所有文件。对于每个文件,
    3. 对于文件中的每一行,请使用Split函数创建一个包含给定行的所有值的数组。
    4. 创建一个映射数组,为第一步中选择的每个列名称存储文件中列的位置(仅对每个文件的第一行执行此操作)
    5. 在文件(“大.csv”)中写入标头(仅执行一次)
    6. 在同一大文件中,针对每个文件的每一行,根据列的位置写入数据。

因此该过程运行良好,我得到了想要的结果,但是却很慢……(在我的计算机上,需要200分钟才能处理大约200个文件,该文件一次附加包含500,000行和200列。一位同事设法做到了一个类似的过程,使用R中的data.table包附加所有文件,并且他能够在5-10分钟内在同一台计算机上对相同的.csv表执行相同的附加操作) 我想知道是否有比“逐个单元”检查文件更好的选择?我可以从源文件中识别不需要的列,然后将其完全删除吗?是否具有将文件附加在一起而不是读取每个单元格然后将它们写回的功能?

在此先感谢您的帮助!

编辑:或者,是否存在另一种编程语言(Python?Power-Shell?)在处理这种文件时效率更高?

Edit2:有关为何我认为它运行缓慢的更多详细信息。

Edit3:与评论中要求的问题有关的一段代码:

'We loop through each file in the folder location
            For Each file As String In files

                Dim objReader As New System.IO.StreamReader(file)
                Dim fileInfo As New IO.FileInfo(file)

                'We only loop through the valid .rpt files
                If CheckValidRPTFile(file, True) = True Then

                    'Count the number of files we go through
                    temp_count = temp_count + 1

                    'Count the number of lines in the file (called FileSize)
                    Dim objReaderLineCOunt As New System.IO.StreamReader(file)
                    FileSize = 0
                    Do While objReaderLineCOunt.Peek() <> -1
                        TextLine = objReaderLineCOunt.ReadLine()
                        FileSize = FileSize + 1
                    Loop

                    temp_line = 1
                    'We loop through line by line for a given file
                    Do While objReader.Peek() <> -1

                        'We split into an array using the comma delimiter
                        TextLine = objReader.ReadLine()
                        TextLineSplit = TextLine.Split(", ")

                        'Skip some lines
                        If Strings.Left(TextLine, 1) = "*" Then

                            'We loop through the number of field we wish to extract for the file
                            For temp_field = 1 To HeaderNbOfField

                                If temp_field = 1 Then
                                    BodyString = "('" & Strings.Left(fileInfo.Name, Len(fileInfo.Name) - 4) & "',"
                                End If

                                'The array HeaderMapId tells us where to pick the information from the file
                                'This assumes that each file in the folder have same header as the 'HeaderFile'
                                If HeaderMapId(Is_MPF_Type, temp_field) = Is_Not_found Then
                                    BodyString = BodyString & "98766789"
                                Else
                                    BodyString = BodyString & TextLineSplit(HeaderMapId(Is_MPF_Type, temp_field) - 1)
                                End If


                                If temp_field <> HeaderNbOfField Then
                                    BodyString = BodyString & ","
                                Else
                                    BodyString = BodyString & ")"
                                End If
                            Next

                            'We replace double quotes with single quotes
                            BodyString = Replace(BodyString, """", "'")

                            'This Line is to add records to the .csv file
                            If Enable_CSV_Output = "Yes" Then
                                'Remove braquets and single quotes
                                outFile.WriteLine(Replace(Replace(Replace(BodyString, ")", ""), "(", ""), "'", ""))
                            End If

                        End If

                        temp_line = temp_line + 1
                    Loop

                    temp_file = temp_file + 1
                End If

            Next

            outFile.Close()

基于RobertBaron的评论更新的代码:

        'We loop through each file in the folder location
        For Each file As String In files

            Dim objReader As New System.IO.StreamReader(file)
            Dim fileInfo As New IO.FileInfo(file)

            'We only loop through the valid .rpt files
            If CheckValidRPTFile(file, True) = True Then

                'Count the number of files we go through
                temp_count = temp_count + 1

                'Count the number of lines in the file (called FileSize)
                'Dim objReaderLineCOunt As New System.IO.StreamReader(file)
                'FileSize = 0
                'Do While objReaderLineCOunt.Peek() <> -1
                'TextLine = objReaderLineCOunt.ReadLine()
                'FileSize = FileSize + 1
                'Loop

                'temp_line = 1
                ''We loop through line by line for a given file
                'Do While objReader.Peek() <> -1

                Dim TextLines() As String = System.IO.File.ReadAllLines(file)
                For Each TextLine2 In TextLines

                    'Update the Progress Bar
                    temp_full_count = temp_full_count + 1

                    'We split the libe into an array using the comma delimiter
                    'TextLine = objReader.ReadLine()
                    TextLineSplit = TextLine2.Split(", ")

                    'Skip line that are not actual prophet records (skip header and first few lines)
                    If Strings.Left(TextLine2, 1) = "*" Then

                        'We loop through the number of field we wish to extract for the file
                        For temp_field = 1 To HeaderNbOfField

                            If temp_field = 1 Then
                                BodyString = "('" & Strings.Left(fileInfo.Name, Len(fileInfo.Name) - 4) & "',"
                            End If

                            'The array HeaderMapId tells us where to pick the information from the file
                            'This assumes that each file in the folder have same header as the 'HeaderFile'
                            If HeaderMapId(Is_MPF_Type, temp_field) = Is_Not_found Then
                                BodyString = BodyString & "98766789"
                            Else
                                BodyString = BodyString & TextLineSplit(HeaderMapId(Is_MPF_Type, temp_field) - 1)
                            End If


                            If temp_field <> HeaderNbOfField Then
                                BodyString = BodyString & ","
                            Else
                                BodyString = BodyString & ")"
                            End If
                        Next

                        'We replace double quotes with single quotes
                        BodyString = Replace(BodyString, """", "'")

                        'This Line is to add records to the .csv file
                        If Enable_CSV_Output = "Yes" Then
                            'Remove braquets and single quotes
                            outFile.WriteLine(Replace(Replace(Replace(BodyString, ")", ""), "(", ""), "'", ""))
                        End If

                    End If

                    temp_line = temp_line + 1

                Next

                TB_Runlog.Text = "Completed: " & fileInfo.Name & Environment.NewLine & TB_Runlog.Text
                temp_file = temp_file + 1
            End If

        Next

        outFile.Close()

1 个答案:

答案 0 :(得分:0)

加快程序速度的一种方法是减少对磁盘的访问次数。现在,您正在逐行读取每个文件两次。每个文件最有可能适合内存。因此,您可以做的是读取存储器中文件的所有行,然后处理它的行。这样会更快。

类似的东西:

'We only loop through the valid .rpt files
If CheckValidRPTFile(file, True) = True Then

    ''Count the number of files we go through
    'temp_count = temp_count + 1

    ''Count the number of lines in the file (called FileSize)
    'Dim objReaderLineCOunt As New System.IO.StreamReader(file)
    'FileSize = 0
    'Do While objReaderLineCOunt.Peek() <> -1
    '    TextLine = objReaderLineCOunt.ReadLine()
    '    FileSize = FileSize + 1
    'Loop

    'temp_line = 1
    ''We loop through line by line for a given file
    'Do While objReader.Peek() <> -1

    Dim TextLines() As String = System.IO.File.ReadAllLines(file)
    For Each TextLine In TextLines

        'We split into an array using the comma delimiter
        'TextLine = objReader.ReadLine()
        TextLineSplit = TextLine.Split(", ")