正则表达式或字符串与误差容差进行比较

时间:2014-02-03 20:59:02

标签: c# regex string

我正在尝试在C#中进行字符串比较,但有一些错误允许。例如,如果我的搜索词是“欢迎”,但是如果我的比较字符串(通过OCR生成)是“We1come”而我的错误允许值是20%,那么它应该匹配。使用像Levenshtein algorithm之类的东西,这部分并不那么困难。困难的部分是让它在更大的文本块中工作,就像正则表达式一样。例如,也许我的OCR结果是“你好。我的名字是Ben。我的收到我的StackOverflow问题。”,我想要发现We1与我的搜索词相比是一个好结果。

2 个答案:

答案 0 :(得分:1)

花了一段时间,但效果很好。有趣的问题:)

string PossibleString = PossibleString.ToString().ToLower();
string StaticText = StaticText.ToLower();
decimal PossibleStringLength = (PossibleString.Length);
decimal StaticTextLength = (StaticText.Length);
decimal NumberOfErrorsAllowed = Math.Round((StaticTextLength * (ErrorAllowance / 100)), MidpointRounding.AwayFromZero);
int LevenshteinDistance = LevenshteinAlgorithm(StaticText, PossibleString);
string PossibleResult = string.Empty;

if (LevenshteinDistance == PossibleStringLength - StaticTextLength)
{
    // Perfect match. no need to calculate.
    PossibleResult = StaticText;
}
else
{
    int TextLengthBuffer = (int)StaticTextLength - 1;
    int LowestLevenshteinNumber = 999999;

    for (int i = 0; i < 3; i++) // Check for best results with same amount of characters as expected, as well as +/- 1
    {
        for (int e = TextLengthBuffer; e <= (int)PossibleStringLength; e++)
        {
            string possibleResult = (PossibleString.Substring((e - TextLengthBuffer), TextLengthBuffer));
            int lAllowance = (int)(Math.Round((possibleResult.Length - StaticTextLength) + (NumberOfErrorsAllowed), MidpointRounding.AwayFromZero));
            int lNumber = LevenshteinAlgorithm(StaticText, possibleResult);

            if (lNumber <= lAllowance && ((lNumber < LowestLevenshteinNumber) || (TextLengthBuffer == StaticText.Length && lNumber <= LowestLevenshteinNumber)))
            {
                PossibleResult = possibleResult;
                LowestLevenshteinNumber = lNumber;
            }
        }
        TextLengthBuffer++;
    }
}


public static int LevenshteinAlgorithm(string s, string t)
{
    int n = s.Length;
    int m = t.Length;
    int[,] d = new int[n + 1, m + 1];

    if (n == 0)
    {
        return m;
    }

    if (m == 0)
    {
        return n;
    }

    for (int i = 0; i <= n; d[i, 0] = i++)
    {
    }

    for (int j = 0; j <= m; d[0, j] = j++)
    {
    }

    for (int i = 1; i <= n; i++)
    {
        for (int j = 1; j <= m; j++)
        {
            int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

            d[i, j] = Math.Min(
                Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                d[i - 1, j - 1] + cost);
        }
    }
    return d[n, m];
}

答案 1 :(得分:0)

如果某种方式可以预测OCR如何错过字母,我会用搜索错误替换搜索中的字母。

如果搜索结果为Welcome,则正则表达式为(?i)We[l1]come