大CSV解析

时间:2017-11-07 10:35:23

标签: csv vbscript

我很难找到一个强大而轻巧的算法来处理一些大的CSV文件。以下是它们外观的最小示例:

Time    a     b     c
0      2.9   1.6   4.1
0      3.6   1.1   0.5
0      3.4   0.2   1.7
1.2    0.1   4.2   1.9
1.201  2.3   3.1   4.8
9.99   0.2   0.8   1.2
10     3.1   3.3   2.3
10     3.6   3.5   3.0
10.01  1.1   4.5   3.9
10.01  2.2   3.0   2.3
17     4.3   2.3   3.8
20     1.0   3.2   3.0
30     4.1   3.0   4.9
40     3.8   3.3   1.6

我需要根据这些规则对CSV进行后处理:

  1. 只需要考虑时间为10的倍数的行
  2. 如果多行具有相同的时间戳,则取不同行中每列的平均值
  3. 这是我想要的输出:

    Time    a     b     c
    0      3.3   0.97  2.1
    10     2.04  3.02  2.54
    20     1.0   3.2   3.0
    30     4.1   3.0   4.9
    40     3.8   3.3   1.6
    

    现在约束:我的脚本需要在Windows机器上处理相当大的CSV(最多几百MB)而没有足够的可用内存。因此,我并不热衷于将所有CSV存储在大量字典中,但我更愿意逐行执行此操作。

    这是我的第一次天真尝试。它非常差,工作不正常。 (小幅度说明:平均值不是真正的平均值,而是一种奇怪的“平均值”。在这里跟我说,我试图评估工作流程,并且在这个阶段并不真正关心数字。)

    filename = "test"
    sampling_time = 10.0
    tolerance = 1e-1
    
    Dim FSO, input, output
    Const ForReading = 1
    Const ForWriting = 2
    
    'Create the objects
    Set FSO = CreateObject("Scripting.FileSystemObject")
    Set input = FSO.OpenTextFile(filename & ".csv", ForReading, False)
    Set output = FSO.OpenTextFile(filename & "_output.csv", ForWriting, True)
    
    'First line: write headers
    s = input.ReadLine()
    output.WriteLine s
    
    'Second line: initialize sSplit_old
    s = input.ReadLine()
    sSplit = Split(s, ",")
    sSplit_old = sSplit
    
    'Keep reading...
    Do Until input.AtEndOfStream
        'read new line and split it into its components
        'this is needed to read the first element of the line, i.e. the time
        s = input.ReadLine()
        sSplit = Split(s, ",")
    
        'If the remainder of time/sampling_time is below the tolerance then the
        'line has to be processed.
        'Here the "\" operator (i.e. the integer division: 5\2=2, while 5/2=2.5)
        'is used as the "Mod" operator return integer remainders.
        If CDbl(sSplit(0))-sampling_time*(CDbl(sSplit(0))\sampling_time) < tolerance Then
            'If the current time is close to the previous one (within a tolerance)...
            If Abs(CDbl(sSplit(0))-CDbl(sSplit_old(0))) < tolerance Then
                '... cycle through the arrays and store the average
                For i = 0 To UBound(sSplit)
                    sSplit_old(i) = (CDbl(sSplit(i)) + CDbl(sSplit_old(i))) / 2.0
                Next
            Else
                '... otherwise just write the previous time and save the current
                'one to compare it to the next one
                s = Join(sSplit_old, ",")
                output.WriteLine s
                sSplit_old = sSplit
            End If
        End If
    Loop
    
    output.WriteLine s
    input.Close
    output.Close
    

1 个答案:

答案 0 :(得分:1)

当您为Windows操作系统支付(太多)时,您还支付了SQL引擎的费用。所以使用它:

Option Explicit

Dim db : Set db = CreateObject("ADODB.Connection")
Dim dd : dd = "E:\work\proj\soa\47155733\data"
Dim cs
If "AMD64" = CreateObject("WScript.Shell").ExpandEnvironmentStrings("%PROCESSOR_ARCHITECTURE%") Then
   cs = "Driver=Microsoft Access Text Driver (*.txt, *.csv);Dbq=" & dd & ";Extensions=asc,csv,tab,txt;"
   WScript.Echo "64 Bit:", cs
Else
   cs = "Driver={Microsoft Text Driver (*.txt; *.csv)};Dbq=" & dd & ";Extensions=asc,csv,tab,txt;"
   WScript.Echo "32 Bit:", cs
End If
db.Open cs

Dim ss : ss = "SELECT * FROM [47155733.txt]"
WScript.Echo ss
WScript.Echo db.Execute(ss).GetString(2,,vbTab,vbCrlf,"*")

ss =   "SELECT t, avg(a), avg(b), avg(c) FROM [47155733.txt]" _
     & " WHERE t = Int(t) And 0.0 = t Mod 10 GROUP BY t"
WScript.Echo ss
WScript.Echo db.Execute(ss).GetString(2,,vbTab,vbCrlf,"*")

ss = "SELECT Round(1/3, 3)"
WScript.Echo ss
WScript.Echo db.Execute(ss).GetString(2,,vbTab,vbCrlf,"*")

输出:

cscript 47155733.vbs
SELECT * FROM [47155733.txt]
0       2,9     1,6     4,1
0       3,6     1,1     0,5
0       3,4     0,2     1,7
1,2     0,1     4,2     1,9
1,201   2,3     3,1     4,8
9,99    0,2     0,8     1,2
10      3,1     3,3     2,3
10      3,6     3,5     3
10,01   1,1     4,5     3,9
10,01   2,2     3       2,3
17      4,3     2,3     3,8
20      1       3,2     3
30      4,1     3       4,9
40      3,8     3,3     1,6

SELECT t, avg(a), avg(b), avg(c) FROM [47155733.txt] WHERE t = Int(t) And 0.0 = t Mod 10 GROUP BY t
0       3,3     0,966666666666667       2,1
10      3,35    3,4     2,65
20      1       3,2     3
30      4,1     3       4,9
40      3,8     3,3     1,6

SELECT Round(1/3, 3)
0,333

在Windows 10上测试32位和64位;德国语言。我更喜欢在schema.ini文件中指定文件格式:

[47155733.txt]
Format=Delimited(,)
ColNameHeader=True
DecimalSymbol=.
Col1=t Double
Col2=a Double
Col3=b Double
Col4=c Double

背景:

Connectionstringsodbc ConnectionstringsDriver download