Question

创建了一个程序，用于根据给定参数检查csv文件。但是，当我使用正则表达式

添加字符限制时，效率下降了

awk -F, '
BEGIN { FPAT = "([^,]+)|(\"[^\"]+\")" }
NF!=17 { print "incorrect   amount    of fields-OFFFENCE FILE"; next}
#splits the line up into 17 seperate fields when encountering a comma,
#however ignores commas located within double quotes and then assigns  each field to a varible to be checked later.
#then counts the amount of  fields if more or less than 17 prints message.
!($1~/^("[A-Z0-9]{1,25}")$/) {print "1st field invalid-OFFENCE FILE";}
#check the data contained within varible 1 that in this case has only  uppercase letters and numbers and consists of
#between 1 and 25 characters and that it also begins and ends with a double quote
!($2~/("[[:digit:]]{1,3}")$/) {print "2nd field invalid-OFFENCE FILE";}
!($3~/^("[A-Z0-9]{1,8}")$/) {print "3rd field invalid-OFFENCE FILE";}
!($4~/^("[A-Z0-9]{0,1}")$/) {print "4th field invalid-OFFENCE FILE";}
!($5~/^("[A-Z0-9]{0,11}")$/) {print "5th field invalid-OFFENCE FILE";}
!($6~/^("")$/) {print "6th field invalid-OFFENCE FILE";}
!($7~/^("[0-9]{4}[-/][0-9]{2}[-/][0-9]{2}")$/B) {print "7th field invalid-OFFENCE FILE";}
!($8~/^("[1-5]{1}")$/) {print "8th field invalid-OFFENCE FILE";}
!($9~/^("[0-9]{4}[-/][0-9]{2}[-/][0-9]{2}")$/) {print "9th field invalid-OFFENCE FILE";}
!($10~/^("[0-9]{4}[-/][0-9]{2}[-/][0-9]{2}")$/) {print "10th field invalid-OFFENCE FILE";}
# the validation above checks for dates in the format #YYYY-MM-DD with either a - or a / as a seperator
!($11~/^("([01]?[0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9]")|""$/) {print "11th field invalid-OFFENCE FILE";}
#the regex above tests for times to make sure they meet the format of hh:mm:ss
!($12~/^("([01]?[0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9]")|""$/) {print "12th field invalid-OFFENCE FILE";}
!($13~/^("[A-Za-z0-9]{0,70}")|""$/) {print "13th field invalid-OFFENCE FILE";}
!($14~/^("[A-Za-z0-9]{1}")|""$/) {print "14th field invalid-OFFENCE FILE";}
!($15~/^("[0-9]{3}")$/) {print "15th field invalid-OFFENCE FILE";}
!($16~/^(".+{1,2500}")$/) {print "16th field invalid-OFFENCE FILE";}
!($17~/^(".+{1,4000}")|""$/) {print "17th field invalid-OFFENCE FILE";}
{print  "previous field set correct_OFFENCE FILE "}' nppcase_***_******_offence_**************.csv

所以我的问题是，有没有办法提高效率。

一个例子是无关紧要的，无论变量是完整还是空，效率低下基本上我都希望代码运行得更快，而问题是正则表达式的最大字符长度$ 16和$ 17太高了

Answer 1

!($1~/^("[A-Z0-9]{1,25}")$/) {print "1st field invalid-OFFENCE FILE";}

可以用类似的东西代替（每个正则表达式都需要reviex），例如：

!($1~/^("[A-Z0-9]+")$/) || (length($1)>27) {print "1st field invalid-OFFENCE FILE";}

在我的测试中，它快两倍，因为仅长度可以更快地检查符合字符的字符串中是否有任何长度。现在，优化实际上取决于规范以及我们可以对数据源进行的假设。

Answer 2

对您的所有条件进行此类更改，由于使用了更简单的正则表达式，您应该会看到性能的改善：

old: !($13~/^("[A-Za-z0-9]{0,70}")|""$/)        {print "13th field invalid-OFFENCE FILE";}
new: !( ($13 ~ /^"[A-Za-z0-9]*"$/) && (length($13) <= 72) )

old: !($14~/^("[A-Za-z0-9]{1}")|""$/)           {print "14th field invalid-OFFENCE FILE";}
new: !($14 ~ /^"[A-Za-z0-9]?"$/)

old: !($15~/^("[0-9]{3}")$/)                    {print "15th field invalid-OFFENCE FILE";}
new: !($15 ~ /^"[0-9]{3}"$/)

old: !($16~/^(".+{1,2500}")$/)                  {print "16th field invalid-OFFENCE FILE";}
new: !( ($16 ~ /^".+"$/) && (length($16) <= 2502) )

old: !($17~/^(".+{1,4000}")|""$/)               {print "17th field invalid-OFFENCE FILE";}
new: !( ($17 ~ /^".*"$/) && (length($17) <= 4002) )

您应该使用[[:alnum:]]这样的字符类，而不是像[A-Za-z0-9]这样硬编码特定于语言环境的范围，以实现可移植性。

如何提高AWK程序的效率

2 个答案: