我正在尝试使用Levenshtein Distance的帮助在OCR页面上找到模糊关键字(静态文本)。
为此,我想提供一定比例的允许错误(例如,15%)。
string Keyword = "past due electric service";
由于关键字长度为25个字符,我想允许4个错误(25 * .15向上舍入)
我需要能够将它与......进行比较。
string Entire_OCR_Page = "previous bill amount payment received on 12/26/13 thank
you! current electric service total balances unpaid 7
days after the total due date are subject to a late
charge of 7.5% of the amount due or $2.00, whichever/5
greater. "
这就是我现在正在做的事情......
int LevenshteinDistance = LevenshteinAlgorithm(Keyword, Entire_OCR_Page); // = 202
int NumberOfErrorsAllowed = 4;
int Allowance = (Entire_OCR_Page.Length() - Keyword.Length()) + NumberOfErrorsAllowed; // = 205
显然,Keyword
中找不到OCR_Text
(它不应该是{1}}。但是,使用Levenshtein的距离,误差的数量小于15%的余地(因此我的逻辑说它已被发现)。
有谁知道更好的方法吗?
答案 0 :(得分:1)
使用子字符串回答了我的问题。发布以防其他人遇到相同类型的问题。有点不正统,但它对我很有用。
int TextLengthBuffer = (int)StaticTextLength - 1; //start looking for correct result with one less character than it should have.
int LowestLevenshteinNumber = 999999; //initialize insanely high maximum
decimal PossibleStringLength = (PossibleString.Length); //Length of string to search
decimal StaticTextLength = (StaticText.Length); //Length of text to search for
decimal NumberOfErrorsAllowed = Math.Round((StaticTextLength * (ErrorAllowance / 100)), MidpointRounding.AwayFromZero); //Find number of errors allowed with given ErrorAllowance percentage
//Look for best match with 1 less character than it should have, then the correct amount of characters.
//And last, with 1 more character. (This is because one letter can be recognized as
//two (W -> VV) and visa versa)
for (int i = 0; i < 3; i++)
{
for (int e = TextLengthBuffer; e <= (int)PossibleStringLength; e++)
{
string possibleResult = (PossibleString.Substring((e - TextLengthBuffer), TextLengthBuffer));
int lAllowance = (int)(Math.Round((possibleResult.Length - StaticTextLength) + (NumberOfErrorsAllowed), MidpointRounding.AwayFromZero));
int lNumber = LevenshteinAlgorithm(StaticText, possibleResult);
if (lNumber <= lAllowance && ((lNumber < LowestLevenshteinNumber) || (TextLengthBuffer == StaticText.Length && lNumber <= LowestLevenshteinNumber)))
{
PossibleResult = (new StaticTextResult { text = possibleResult, errors = lNumber });
LowestLevenshteinNumber = lNumber;
}
}
TextLengthBuffer++;
}
public static int LevenshteinAlgorithm(string s, string t) // Levenshtein Algorithm
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= m; j++)
{
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
return d[n, m];
}
答案 1 :(得分:0)
我认为它不起作用,因为你的大量字符串是匹配的。所以我要做的就是尝试将你的关键字分成单独的单词。
然后找到OCR_TEXT中匹配这些单词的所有地方。
然后查看匹配的所有地方,看看这些地方中有4个是连续的并且与原始短语匹配。
我不确定我的解释是否清楚?