我是使用RegEx的新手。
我有一个公司短语(1000+)的列表,我在运行时将其转换为正则表达式模式。
以下是我构建模式的方法:
ListOfEntries.Sort()
For i As Integer = 0 To (ListOfEntries.Count - 1)
ListOfRegExEntries.Add("(\b(?i)" & ListOfEntries(i) & "\b)")
Next
RegExPatternString = "(" & String.Join("|", ListOfRegExEntries) & ")"
RegExPattern = New Regex(RegExPatternString)
条目全部大写。
匹配的字符串是全名字段。我只想知道string是否包含公司关键字。
我可以做些什么来优化匹配过程?如果有人需要更多信息,请随时提出!
答案 0 :(得分:1)
对于其他一些答案/评论,似乎RegEx不是最佳选择。我决定使用这段代码
Private Function ContainsOrganizationKeywordTest2() As Boolean
With Output
Dim BuiltFullName As String = UCase(String.Join(Space, {.PrimaryFirstName, .PrimaryMiddleName, .PrimaryLastName}))
Dim NameParts As List(Of String) = BuiltFullName.Split(Space).ToList
NameParts.Sort()
For i As Integer = 0 To (NameParts.Count - 1)
If (Not String.IsNullOrWhiteSpace(NameParts(i))) Then
Dim Result As Integer = _OrganizationKeywords.ListOfEntries.BinarySearch(NameParts(i))
If (Result > -1) Then
Return True
End If
End If
Next
Return False
End With
End Function
答案 1 :(得分:0)
几个问题
答案 2 :(得分:0)
除了Perl之外,我不知道任何可以进行调试的正则表达式引擎 所以,作为类比,我使用Perl示例代码来展示如何减少相当长的时间 做这样的正则表达式时。
我确定您可以将此代码翻译成vb
它基本上只是将每个短语的第一个字母分解出来并创建一个
连接在一起的那些短语的数组。我用哈希来做这个,
但可以通过对所有短语进行排序,然后循环遍历来轻松完成
每个人都有相同的第一个字母。
首先,拥有+1000个短语可能会有大部分字母的字母
作为短语中某个地方的起始角色,所以正常的 trie 将无济于事
在一个平坦的正则表达式。
然后,在平面正则表达式的情况下,必须测试每个短语,直到匹配为止 这是源字符串中每个字符+1000个测试。相当多的开销。
当你将每个短语的第一个字母分解出来时,你可以立即将它除以26
当你这样做时,为每个字母打开一个辅助 trie 类,进一步减少
开销有很多因素。
如果您为2个字符执行此操作,则会降低到几乎可以忽略不计的开销量。
下面显示了 FLAT 1级( trie )正则表达式的调试,
和单个字符级别因素之一。
要分析正则表达式,请按照每个TRIEC-EXACTF[..]
中的路径表示终止
点(通过或失败)。
您可以看到路径显着减少。
Perl代码:
use strict;
use warnings;
use Data::Dumper;
use re 'debug';
my @Flat_Rx_ary = ();
my @rx_ary = ();
my %LetterHash = ();
while (my $line = <DATA>)
{
chomp( $line );
next if ( length($line) == 0);
push ( @Flat_Rx_ary, $line );
my $first_char = substr( $line, 0, 1);
my $remainder = substr( $line, 1 );
if ( !defined( $LetterHash{ $first_char } )) {
$LetterHash{ $first_char } = [];
}
push ( @{$LetterHash{ $first_char }}, $remainder );
}
print Dumper(\%LetterHash);
# Factored regex ..
my @rx_parts = ();
foreach my $rx_key ( keys %LetterHash )
{
@{$LetterHash{ $rx_key }} = sort @{$LetterHash{ $rx_key }};
my $rx_val = join ( '|', @{$LetterHash{ $rx_key }} );
push ( @rx_parts, '(?:' . $rx_key . '(?:' . $rx_val . '))' );
}
my $total_rx = '(?i)\b(' . join( '|', @rx_parts ) . ')\b';
print $total_rx, "\n\n\n";
my $CompiledRx = qr /$total_rx/;
# Flat regex ..
@Flat_Rx_ary = sort ( @Flat_Rx_ary );
my $Flat_Total_Rx = '(?i)\b(' . join( '|',@Flat_Rx_ary) . ')\b';
print "\n\n\n", $Flat_Total_Rx, "\n\n\n";
my $CompiledFlatRx = qr /$Flat_Total_Rx/;
__DATA__
hello world
this is cool
good day
one day beyond
a very fine time
the end of the season
the trial of the centurn
total eclipse
game on
hello LA
输出:
$VAR1 = {
'a' => [
' very fine time'
],
'h' => [
'ello world',
'ello LA'
],
'g' => [
'ood day',
'ame on'
],
'o' => [
'ne day beyond'
],
't' => [
'his is cool',
'he end of the season',
'he trial of the centurn',
'otal eclipse'
]
};
(?i)\b((?:a(?: very fine time))|(?:h(?:ello LA|ello world))|(?:g(?:ame on|ood da
y))|(?:o(?:ne day beyond))|(?:t(?:he end of the season|he trial of the centurn|h
is is cool|otal eclipse)))\b
Compiling REx "(?i)\b((?:a(?: very fine time))|(?:h(?:ello LA|ello world))|"...
Final program:
1: BOUND (2)
2: OPEN1 (4)
4: TRIEC-EXACTF[AGHOTaghot] (74)
<a very fine time> (74)
<h> (15)
15: EXACTF <ello > (18)
18: TRIE-EXACTF[LWlw] (74)
<LA>
<world>
<g> (28)
28: TRIE-EXACTF[AOao] (74)
<ame on>
<ood day>
<one day beyond> (74)
<t> (48)
48: TRIEC-EXACTF[HOho] (74)
<he end of the season>
<he trial of the centurn>
<his is cool>
<otal eclipse>
74: CLOSE1 (76)
76: BOUND (77)
77: END (0)
stclass BOUND minlen 7
(?i)\b(a very fine time|game on|good day|hello LA|hello world|one day beyond|the
end of the season|the trial of the centurn|this is cool|total eclipse)\b
Compiling REx "(?i)\b(a very fine time|game on|good day|hello LA|hello worl"...
Final program:
1: BOUND (2)
2: OPEN1 (4)
4: TRIEC-EXACTF[AGHOTaghot] (60)
<a very fine time>
<game on>
<good day>
<hello LA>
<hello world>
<one day beyond>
<the end of the season>
<the trial of the centurn>
<this is cool>
<total eclipse>
60: CLOSE1 (62)
62: BOUND (63)
63: END (0)
stclass BOUND minlen 7
Freeing REx: "(?i)\b((?:a(?: very fine time))|(?:h(?:ello LA|ello world))|"...
Freeing REx: "(?i)\b(a very fine time|game on|good day|hello LA|hello worl"...