我使用GSKinner's Reg Exr tool来帮助提出一种模式,该模式可以在包含大量其他垃圾的字段中找到授权号。授权号是一个字符串,包含字母(有时),数字(总是)和连字符(有时)(即授权总是包含某个地方的数字,但并不总是包含连字符和字母)。此外,授权号码可以位于我正在搜索的字段中的任何位置。
适当授权号的示例包括:
5555834384734 ' All digits
12110-AANM ' Alpha plus digits, plus hyphens
R-455545-AB-9 ' Alpha plus digits, plus multiple hyphens
R-45-54A-AB-9 ' Alpha plus digits, plus multiple hyphens
W892160 ' Alpha plus digits without hypens
这里有一些带有额外垃圾的示例数据,有时会将其附加到带有连字符或没有空格的实际授权号,使其看起来像数字的一部分。垃圾以可预测的形式/单词出现:REF,CHEST,IP,AMB,OBV和HOLD,它们不属于授权号。
5557653700 IP
R025257413-001
REF 120407175
SNK601M71016
U0504124 AMB
W892160
019870270000000
00Q926K2
A025229563
01615217 AMB
12042-0148
SNK601M71016
12096NHP174
12100-ACDE
12110-AANM
12114AD5QIP
REF-34555
3681869/OBV ONL
以下是我使用的模式:
"\b[a-zA-Z]*[\d]+[-]*[\d]*[A-Za-z0-9]*[\b]*"
我正在学习RegExp,所以毫无疑问可以改进,但它适用于上述情况,而不适用于以下情况:
REFA5-208-4990IP 'Extract the string 'A5-208-4990'without REF or IP
OBV1213110379 'Extract the string '1213110379' without the OBV
5520849900AMB 'Extract the string '5520849900' without AMB
5520849900CHEST 'Extract the string '5520849900' without CHEST
5520849900-IP 'Extract the string '5520849900' without -IP
1205310691-OBV 'Extract the string without the -OBV
R-025257413-001 'Numbers of this form should also be allowed.
NO PCT 93660 'If string contains the word NO anywhere, it is not a match
HOLDA5-208-4990 'If string contains the word HOLD anywhere, it is not a match
有人可以帮忙吗?
出于测试目的,这里的Sub创建了一个包含样本输入数据的表:
Sub CreateTestAuth()
Dim dbs As Database
Set dbs = CurrentDb
With dbs
.Execute "CREATE TABLE tbl_test_auth " _
& "(AUTHSTR CHAR);"
.Execute " INSERT INTO tbl_test_auth " _
& "(AUTHSTR) VALUES " _
& "('5557653700 IP');"
.Execute " INSERT INTO tbl_test_auth " _
& "(AUTHSTR) VALUES " _
& "(' R025257413-001');"
.Execute " INSERT INTO tbl_test_auth " _
& "(AUTHSTR) VALUES " _
& "('REF 120407175');"
.Execute " INSERT INTO tbl_test_auth " _
& "(AUTHSTR) VALUES " _
& "('SNK601M71016');"
.Execute " INSERT INTO tbl_test_auth " _
& "(AUTHSTR) VALUES " _
& "('U0504124 AMB');"
.Execute " INSERT INTO tbl_test_auth " _
& "(AUTHSTR) VALUES " _
& "('3681869/OBV ONL');"
.Execute " INSERT INTO tbl_test_auth " _
& "(AUTHSTR) VALUES " _
& "('REFA5-208-4990IP');"
.Execute " INSERT INTO tbl_test_auth " _
& "(AUTHSTR) VALUES " _
& "('5520849900AMB');"
.Execute " INSERT INTO tbl_test_auth " _
& "(AUTHSTR) VALUES " _
& "('5520849900CHEST');"
.Execute " INSERT INTO tbl_test_auth " _
& "(AUTHSTR) VALUES " _
& "('5520849900-IP');"
.Execute " INSERT INTO tbl_test_auth " _
& "(AUTHSTR) VALUES " _
& "('1205310691-OBV');"
.Execute " INSERT INTO tbl_test_auth " _
& "(AUTHSTR) VALUES " _
& "('HOLDA5-208-4990');"
.Close
End With
End Sub
答案 0 :(得分:1)
好的,起初我认为额外的要求会使正则表达式很多更长。
但积极的前瞻,它实际上几乎相同的大小。这次只有正则表达式:
\b(?=.*\d)([a-z0-9]+(?:-[a-z0-9]+)*)\b
或者用注释细分(忽略空格):
\b # Word start
(?=.*\d) # A number has to follow somewhere after this point
( # Start capture group
[a-z0-9]+ # At least one alphanum
(?:-[a-z0-9]+)* # Possibly more attached with hyphen
) # End capture group
\b # Word end
但请注意,所有正则表达式都不支持可变宽度前瞻。我不知道VBA的一个。
第二个注意事项:如果数字出现在单词结尾之后,(?=)
内容也会得到满足。所以在
DONT-RECOGNIZE-ME 但是1-5ay-yes
大胆的部分将被捕获。
答案 1 :(得分:0)
\ b开始是一个问题。还需要注意一些空间和一些破折号。试试这个“[a-zA-Z|\s|-]*[\d]+[-]*[\d]*[A-Za-z0-9]*[\b]*
”。仅在授权号码上运行此操作。
答案 2 :(得分:0)
由于额外的过滤,我会使用两步法。
var splitter = new Regex(@"[\t\n\r]+", RegexOptions.Multiline);
const string INPUT = @"REFA5-208-4990IP
OBV1213110379
5520849900AMB
5520849900CHEST
5520849900-IP
1205310691-OBV
R-025257413-001
NO PCT 93660
HOLDA5-208-4990";
string[] lines = splitter.Split(INPUT);
var blacklist = new[] { "NO", "HOLD" };
var ignores = new[] { "REF", "IP", "CHEST", "AMB", "OBV" };
var filtered = from line in lines
where blacklist.All(black => line.IndexOf(black) < 0)
select ignores.Aggregate(line, (acc, remove) => acc.Replace(remove, ""));
var authorization = new Regex(@"\b([a-z0-9]+(?:-[a-z0-9]+)*)\b", RegexOptions.IgnoreCase);
foreach (string s in filtered)
{
Console.Write("'{0}' ==> ", s);
var match = authorization.Match(s);
if (match.Success)
{
Console.Write(match.Value);
}
Console.WriteLine();
}
打印:
'A5-208-4990' ==> A5-208-4990
' 1213110379' ==> 1213110379
' 5520849900' ==> 5520849900
' 5520849900' ==> 5520849900
' 5520849900-' ==> 5520849900
' 1205310691-' ==> 1205310691
' R-025257413-001' ==> R-025257413-001
答案 3 :(得分:0)
有时很容易让它松散而不是严格地坚持这种或那种方式。 :)
试试这个:
1 - 添加此功能
Public Function RemoveJunk(ByVal inputValue As String, ParamArray junkWords() As Variant) As String
Dim junkWord
For Each junkWord In junkWords
inputValue = Replace(inputValue, junkWord, "", , , vbBinaryCompare)
Next
RemoveJunk = inputValue
End Function
2 - 现在你的任务很简单。请参阅下面的示例,了解如何使用它:
Sub Sample()
Dim theText As String
theText = " REFA5-208-4990IP blah blah "
theText = RemoveJunk(theText, "-REF", "REF", "-IP", "IP", "-OBV", "OBV") '<-- complete this in a similar way
Debug.Print theText
'' -- now apply the regexp here --
End Sub
RemoveJunk函数调用的完成有点棘手。把较长的那些放在短的之前。例如-OBV应该在“OBV”之前。
试一试,看看它是否能解决您的问题。
答案 4 :(得分:0)
您的示例输入文件(此文件的路径s / b作为function<GetMatches>
提供给inputFilePath
):
5557653700 IP
R025257413-001
REF 120407175
SNK601M71016
U0504124 AMB
W892160
019870270000000
00Q926K2
A025229563
01615217 AMB
12042-0148
SNK601M71016
12096NHP174
12100-ACDE
12110-AANM
12114AD5QIP
REF-34555
3681869/OBV ONL
这里是保存在文件中的邮箱(此文件的路径s / b作为function<GetMatches>
提供给replaceDBPath
):
^REF
IP$
^OBV
AMB$
CHEST$
-OBV$
^.*(NO|HOLD).*$
这里有bas
:
Option Explicit
'This example uses the following references:
'Microsoft VBScript Regular Expressions 5.5 and Microsoft Scripting Runtime
Private fso As New Scripting.FileSystemObject
Private re As New VBScript_RegExp_55.RegExp
Private Function GetJunkList(fpath$) As String()
0 On Error GoTo errHandler
1 If fso.FileExists(fpath) Then
2 Dim junkList() As String, mts As MatchCollection, mt As Match, pos&, tmp$
3 tmp = fso.OpenTextFile(fpath).ReadAll()
4 With re
5 .Global = True
6 .MultiLine = True
7 .Pattern = "[^\r\n]+"
8 Set mts = .Execute(tmp)
9 ReDim junkList(mts.Count - 1)
10 For Each mt In mts
11 junkList(pos) = mt.Value
12 pos = pos + 1
13 Next mt
14 End With
15 GetJunkList = junkList
16 Else
17 MsgBox "File not found at:" & vbCr & fpath
18 End If
19 Exit Function
errHandler:
Dim Msg$
With Err
Msg = "Error '" & .Number & " " & _
.Description & "' occurred in " & _
"Function<GetJunkList> at line # " & IIf(Erl <> 0, " at line " & CStr(Erl) & ".", ".")
End With
MsgBox Msg, vbCritical
End Function
Public Function GetMatches(replaceDBPath$, inputFilePath$) As String()
0 On Error GoTo errHandler
1 Dim junks() As String, junkPat$, tmp$, results() As String, pos&, mts As MatchCollection, mt As Match
2 junks = GetJunkList(replaceDBPath)
3 tmp = fso.OpenTextFile(inputFilePath).ReadAll
4
5 With re
6 .Global = True
7 .MultiLine = True
8 .IgnoreCase = True
9 For pos = LBound(junks) To UBound(junks)
10 .Pattern = junkPat
11 junkPat = junks(pos)
12 'replace junk with []
13 tmp = .Replace(tmp, "")
14 Next pos
15
16 'trim lines [if all input data in one line]
17 .Pattern = "^[ \t]*|[ \t]*$"
18 tmp = .Replace(tmp, "")
19
20 'create array using provided pattern
21 pos = 0
22 .Pattern = "\b[a-z]*[\d]+\-*\d*[a-z0-9]*\b"
23 Set mts = .Execute(tmp)
24 ReDim results(mts.Count - 1)
25 For Each mt In mts
26 results(pos) = mt.Value
27 pos = pos + 1
28 Next mt
29 End With
30
31 GetMatches = results
32 Exit Function
errHandler:
Dim Msg$
With Err
Msg = "Error '" & .Number & " " & _
.Description & "' occurred in " & _
"Function<GetMatches> at line # " & IIf(Erl <> 0, " at line " & CStr(Erl) & ".", ".")
End With
MsgBox Msg, vbCritical
End Function
和样本测试人员
Public Sub tester()
Dim samples() As String, s
samples = GetMatches("C:\Documents and Settings\Cylian\Desktop\junks.lst", "C:\Documents and Settings\Cylian\Desktop\sample.txt")
For Each s In samples
MsgBox s
Next
End Sub
可以从immediate window
调用:
tester
希望这有帮助。