Question

我想匹配文件中几行的两个子字符串。例如我在这里排了这些：

                DU.DUALGN.D3_D5H0TOD4B_RS1DQ.ELC.L2

                DU.DUALGN.D3_D5H0TOD4B_RS2DQ.ELC.L2 

                EC.DU.DUAB0.D0_OPBQ.ELC.L2

我上面有数百万行，我想提取只包含DUALGN和ELC.L2的行

请帮我编写一个正则表达式。

Answer 1

perl -ne 'print if /DUALGN/ and /ELC\.L2/' file

Answer 2

Tcl在这里比Perl更冗长：

tclsh << 'END'
set fh [open "filename" r]
while {[gets $fh line] != -1} {
    if {[regexp {DUALGN} $line] && [regexp {ELC\.L2} $line]} {
        puts $line
    }
}
END

由于我们正在寻找固定字符串，因此我们不需要调用正则表达式引擎。这可能会更快：

    if { [string first "DUALGN" $line] > -1 && 
         [string first "ELC.L2" $line] > -1
    } {
        puts $line
    }

使用单个正则表达式：

如果我记得的话，这个版本使用的是lookaheads，它是在Tcl 8.1中引入的。

^(?=.*DUALGN)(?=.*ELC\.L2)

这意味着：从字符串的开头开始，向前看以找到＆＃34; DUALGN＆＃34;并期待找到＆＃34; ELC.L2＆＃34;。您可以在Perl中使用相同的正则表达式。

如果您的Tcl版本由于某种原因无法处理，您可以执行此操作

(?:DUALGN.*ELC\.L2)|(?:ELC\.L2.*DUALGN)

这意味着：找到＆＃34; DUALGN＆＃34;最后是＆＃34; ELC.L2＆＃34;或找到＆＃34; ELC.L2＆＃34;最后是＆＃34; DUALGN＆＃34;。

因为Donal让我;） - 一些时间：

% set line "DU.DUALGN.D3_D5H0TOD4B_RS1DQ.ELC.L2"
DU.DUALGN.D3_D5H0TOD4B_RS1DQ.ELC.L2
% time {string match *DUALGN* $line; string match *ELC.L2* $line} 1000000
1.122276 microseconds per iteration
% time {string first DUALGN $line; string first ELC.L2 $line} 1000000
1.0179 microseconds per iteration
% time {regexp {^(?=.*DUALGN)(?=.*ELC\.L2)} $line} 1000000
12.840028 microseconds per iteration
% time {regexp {(?:DUALGN.*ELC\.L2)|(?:ELC\.L2.*DUALGN)} $line} 1000000
12.770246 microseconds per iteration
% time {regexp DUALGN $line; regexp ELC\\.L2 $line} 1000000
1.140218 microseconds per iteration

结论：选择使用一个正则表达式，您恰好选择了最慢的实现。

从一行中提取子字符串

2 个答案: