Question

我正在使用正则表达式来解析类似CSV的文件。我是正则表达式的新手，并且，当它有效时，当有许多字段并且其中一个字段包含非常长的值时，它会变慢。我该如何优化呢？

我必须解析的CSV具有以下风格：

所有字段都是用逗号分隔的引号括起来的字符串
内部字段中的引号以两个连续引号的形式转义
在一些行的开头有不可预测的垃圾需要忽略（到目前为止它没有包含引号，谢天谢地）
可以使用字段中的零长度字段和换行符

我正在使用VB.NET。我正在使用以下正则表达式：

(^(?!").+?|^(?="))(?<Entry>"(",|(.*?)"(?<!((?!").("")+)),))*(?<LastEntry>"("$|(.*?)"(?<!((?!").("")+))$))

我通过将StreamReader.ReadLine提供给字符串变量来处理换行符，直到正则表达式成功，用空格替换换行符（这对我来说是可以的）。然后我使用Match.Groups（“Entry”）提取字段内容。捕获和Match.Groups（“LastEntry”）。

我认为性能影响来自转发报价的后视。还有更好的方法吗？

感谢您的任何想法！

Answer 1

我认为你的正则表达式是不必要的复杂，嵌套量词导致catastrophic backtracking。请尝试以下方法：

^[^"]*(?<Entry>(?>"(?>[^"]+|"")*"),)*(?<LastEntry>(?>"(?>[^"]+|"")*"))$

<强>解释

^                 # Start of string
[^"]*             # Optional non-quotes
(?<Entry>         # Match group 'entry'
 (?>              # Match, and don't allow backtracking (atomic group):
  "               # a quote
  (?>             # followed by this atomic group:
   [^"]+          # one or more non-quote characters
  |               # or
   ""             # two quotes in a row
  )*              # repeat 0 or more times.
  "               # Then match a closing quote
 )                # End of atomic group
 ,                # Match a comma
)*                # End of group 'entry'
(?<LastEntry>     # Match the final group 'lastEntry'
 (?>              # same as before
  "               # quoted field...
  (?>[^"]+|"")*   # containing non-quotes or double-quotes
  "               # and a closing quote
 )                # exactly once.
)                 # End of group 'lastEntry'
$                 # End of string

这也适用于整个文件，所以你不必在下一个之后添加一行，直到正则表达式匹配，你就不必替换新行：

Dim RegexObj As New Regex("^[^""]*(?<Entry>(?>""(?:[^""]+|"""")*""),)*(?<LastEntry>(?>""(?:[^""]+|"""")*""))$", RegexOptions.Multiline)
Dim MatchResults As Match = RegexObj.Match(SubjectString)
While MatchResults.Success
    ' now you can access MatchResults.Groups("Entry").Captures and
    ' MatchResults.Groups("LastEntry")
    MatchResults = MatchResults.NextMatch()
End While

CSV解析正则表达式性能

1 个答案: