Question

我有一个文本文件，我想解析并“清理”。来自文件的示例数据

Trade '4379160'\Acquire Day 2015-05-07      Create  acquire_day
Trade '4379160'\Fund    XXXY        Create  acquirer_ptynbr
Trade '4379160'\Assinf          Create  assinf
Trade '4379160'\Authorizer          Create  authorizer_usrnbr
Trade '4379160'\Base Curr Equivalent    0       Create  base_cost_dirty

我想要实现的是在第一个反斜杠之后获得前2个“字段”。例如，Acquire Day 2015-05-07。请注意，有时第二个字段为空（这是正常的 - 我不需要任何创建字符串）。我所做的是使用RegEx首先在反斜杠后找到任何内容，然后获得2个必填字段。到目前为止我的测试代码

Private Sub SanitiseTradeAudit(fileInput)
    Dim objFSO, objFile, regEx, validTxt, validTxt1, arrValidTxt, i

    Set objFSO = CreateObject("Scripting.FileSystemObject")

    Set objFile = objFSO.OpenTextFile(fileInput, 1) 
    validTxt = objFile.ReadAll
    objFile.Close
    Set objFile = Nothing

    Set regEx = New RegExp
    regEx.Pattern = "(.*)\'\\(.*)" 'To Remove all [[ Trade '4379160'\ ]] prefix from audit lines
    regEx.Global = True 
    validTxt = regEx.Replace(validTxt, "$2") 'Text would be ==> Aggregate   0       Create  aggregate

    regEx.Pattern = "[(\t.*)](\t.*)" 'Pick only first 2 data points ==> Aggregate   0
    regEx.Global = True
    validTxt1 = regEx.Replace(validTxt, vbCr)

    arrValidTxt = Split(validTxt1, vbCrLf) 'To Remove the first 2 header lines, split it based on new line
    Set objFile = objFSO.OpenTextFile(fileInput, 2)
    For i = 2 To (Ubound(arrValidTxt) - 1) 'Ignore first 2 header lines
        objFile.WriteLine arrValidTxt(i)
    Next
    objFile.Close
    Set objFile = Nothing

    Set regEx = Nothing
    Set objFSO = Nothing
End sub


Call SanitiseTradeAudit("C:\Users\pankaj.jaju\Desktop\ActualAuditMessage.txt")

我的问题是 - 这个正则表达式替换可以用一种模式完成吗？

Answer 1

如果您逐行处理文件，这样的模式应该有效：

^.*?\\([^\t]*)\t([^\t]*)

以上匹配所有内容，直到第一个反斜杠（非贪婪匹配），然后是由单个选项卡分隔的两组零或多个非制表符（贪婪匹配）。

示例代码：

Set re = New RegExp
re.Pattern = "^.*?\\([^\t]*)\t([^\t]*)"

txt = objFSO.OpenTextFile(fileInput).ReadAll

Set objFile = objFSO.OpenTextFile(fileInput)
For Each line In Split(txt, vbNewLine)
  For Each m In re.Execute(line)
    objFile.WriteLine m.SubMatches(0) & vbTab & m.SubMatches(1)
  Next
Next
objFile.Close

如果你需要处理大文件，我会完全删除ReadAll并逐行读取输入文件以避免内存耗尽：

Set re = New RegExp
re.Pattern = "^.*?\\([^\t]*)\t([^\t]*)"

Set inFile  = objFSO.OpenTextFile(fileInput)
Set outFile = objFSO.OpenTextFile(fileOutput, 2, True)

Do Until inFile.AtEndOfStream
  line = inFile.ReadLine
  For Each m In re.Execute(line)
    objFile.WriteLine m.SubMatches(0) & vbTab & m.SubMatches(1)
  Next
Loop

inFile.Close
outFile.Close

合并2个正则表达式模式以获得子串

1 个答案: