我遇到了在OCR识别的文本中匹配字符串并找到它的位置的问题,考虑到可以对错误,缺失或额外字符进行任意容忍。结果应该是最佳匹配位置,可能(不一定)匹配子串的长度。
例如:
String: 9912, 1.What is your name?
Substring: 1. What is your name?
Tolerance: 1
Result: match on character 7
String: Where is our caat if any?
Substring: your cat
Tolerance: 2
Result: match on character 10
String: Tolerance is t0o h1gh.
Substring: Tolerance is too high;
Tolerance: 1
Result: no match
我试图改编Levenstein算法,但它不适用于子串并且不会返回位置。
Delphi中的算法将是首选,但任何实现或伪逻辑都可以。
答案 0 :(得分:8)
这是一个有效的递归实现,但可能不够快。最糟糕的情况是无法找到匹配项,而“What”中的最后一个char除了Where中的每个索引都匹配。在这种情况下,算法将对Where中的每个char进行Length(What)-1 + Tolerance comprasions,每个Tolerance加一次递归调用。既然Tolerance和What是constnats的长度,我会说算法是O(n)。它的性能会随着“What”和“Where”的长度线性降低。
function BrouteFindFirst(What, Where:string; Tolerance:Integer; out AtIndex, OfLength:Integer):Boolean;
var i:Integer;
aLen:Integer;
WhatLen, WhereLen:Integer;
function BrouteCompare(wherePos, whatPos, Tolerance:Integer; out Len:Integer):Boolean;
var aLen:Integer;
aRecursiveLen:Integer;
begin
// Skip perfect match characters
aLen := 0;
while (whatPos <= WhatLen) and (wherePos <= WhereLen) and (What[whatPos] = Where[wherePos]) do
begin
Inc(aLen);
Inc(wherePos);
Inc(whatPos);
end;
// Did we find a match?
if (whatPos > WhatLen) then
begin
Result := True;
Len := aLen;
end
else if Tolerance = 0 then
Result := False // No match and no more "wild cards"
else
begin
// We'll make an recursive call to BrouteCompare, allowing for some tolerance in the string
// matching algorithm.
Dec(Tolerance); // use up one "wildcard"
Inc(whatPos); // consider the current char matched
if BrouteCompare(wherePos, whatPos, Tolerance, aRecursiveLen) then
begin
Len := aLen + aRecursiveLen;
Result := True;
end
else if BrouteCompare(wherePos + 1, whatPos, Tolerance, aRecursiveLen) then
begin
Len := aLen + aRecursiveLen;
Result := True;
end
else
Result := False; // no luck!
end;
end;
begin
WhatLen := Length(What);
WhereLen := Length(Where);
for i:=1 to Length(Where) do
begin
if BrouteCompare(i, 1, Tolerance, aLen) then
begin
AtIndex := i;
OfLength := aLen;
Result := True;
Exit;
end;
end;
// No match found!
Result := False;
end;
我使用以下代码测试函数:
procedure TForm18.Button1Click(Sender: TObject);
var AtIndex, OfLength:Integer;
begin
if BrouteFindFirst(Edit2.Text, Edit1.Text, ComboBox1.ItemIndex, AtIndex, OfLength) then
Label3.Caption := 'Found @' + IntToStr(AtIndex) + ', of length ' + IntToStr(OfLength)
else
Label3.Caption := 'Not found';
end;
案例:
String: Where is our caat if any?
Substring: your cat
Tolerance: 2
Result: match on character 10
它显示了长度为6的字符9的匹配。对于其他两个示例,它给出了预期的结果。
答案 1 :(得分:0)
这里是模糊匹配(近似搜索)的完整示例,您可以根据需要使用/更改算法! https://github.com/alidehban/FuzzyMatch