字符串匹配

时间:2016-04-10 06:14:16

标签: python regex python-3.x fuzzy-search fuzzy-logic

提前抱歉这么长的帖子

编辑 -

如果我们找到确切的解决方案,请从诺曼的解决方案进行修改以打印并返回,否则打印所有近似匹配。目前,在第三个pastebin链接上提供的字典文件中搜索etnse的具体示例目前仍然只有83/85匹配。

def doMatching(file, origPattern):
    entireFile = file.read()
    patterns = []
    startIndices = []

    begin = time.time()

    # get all of the patterns associated with the given phrase
    for pattern in generateFuzzyPatterns(origPattern):
        patterns.append(pattern)
        for m in re.finditer(pattern, entireFile):
            startIndices.append((m.start(), m.end(), m.group()))
        # if the first pattern(exact match) is valid, then just print the results and we're done
        if len(startIndices) != 0 and startIndices[0][2] == origPattern:
            print("\nThere is an exact match at: [{}:{}] for {}").format(*startIndices[0])
            return

    print('Used {} patterns:').format(len(patterns))
    for i, p in enumerate(patterns, 1):
        print('- [{}]  {}').format(i, p)

    # list for all non-overlapping starting indices
    nonOverlapping = []
    # hold the last matches ending position
    lastEnd = 0
    # find non-overlapping matches by comparing each matches starting index to the previous matches ending index
    # if the starting index > previous items ending index they aren't overlapping
    for start in sorted(startIndices):
        print(start)
        if start[0] >= lastEnd:
            # startIndicex[start][0] gets the ending index from the current matches tuple
            lastEnd = start[1]
            nonOverlapping.append(start)

    print()
    print('Found {} matches:').format(len(startIndices))
    # i is the key <starting index> assigned to the value of the indices (<ending index>, <string at those indices>
    for start in sorted(startIndices):
        # *startIndices[i] means to unpack the tuple associated to the key i's value to be used by format as 2 inputs
        # for explanation, see: http://stackoverflow.com/questions/2921847/what-does-the-star-operator-mean-in-python
        print('- [{}:{}]  {}').format(*start)

    print()
    print('Found {} non-overlapping matches:').format(len(nonOverlapping))
    for ov in nonOverlapping:
        print('- [{}:{}]  {}').format(*ov)

    end = time.time()
    print(end-begin)

def generateFuzzyPatterns(origPattern):
    # Escape individual symbols.
    origPattern = [re.escape(c) for c in origPattern]

    # Find exact matches.
    pattern = ''.join(origPattern)
    yield pattern

    # Find matches with changes. (replace)
    for i in range(len(origPattern)):
        t = origPattern[:]
        # replace with a wildcard for each index
        t[i] = '.'
        pattern = ''.join(t)
        yield pattern

    # Find matches with deletions. (omitted)
    for i in range(len(origPattern)):
        t = origPattern[:]
        # remove a char for each index
        t[i] = ''
        pattern = ''.join(t)
        yield pattern

    # Find matches with insertions.
    for i in range(len(origPattern) + 1):
        t = origPattern[:]
        # insert a wildcard between adjacent chars for each index
        t.insert(i, '.')
        pattern = ''.join(t)
        yield pattern

    # Find two adjacent characters being swapped.
    for i in range(len(origPattern) - 1):
        t = origPattern[:]
        if t[i] != t[i + 1]:
            t[i], t[i + 1] = t[i + 1], t[i]
            pattern = ''.join(t)
            yield pattern

ORIGINAL: http://pastebin.com/bAXeYZcD - 实际功能

http://pastebin.com/YSfD00Ju - 要使用的数据,应该是'ware'的8个匹配项,但只能获得6个

http://pastebin.com/S9u50ig0 - 要使用的数据,应该为'etnse'获得85场比赛但只得到77

我将所有原始代码留在了函数中,因为我不确定究竟是什么导致了这个问题。

你可以在任何事情上搜索'Board:isFull()'以获得下面所述的错误。

的示例:

假设您将第二个pastebin“someFile.txt”命名为与.py文件位于同一目录中的文件夹中。

file = open('./files/someFile.txt', 'r')
doMatching(file, "ware")

OR

file = open('./files/someFile.txt', 'r')
doMatching(file, "Board:isFull()")

OR

假设您在与.py文件位于同一目录中的名为files的文件夹中命名了第三个pastebin'rolech.txt'。

file = open('./files/dictionary.txt', 'r')
doMatching(file, "etnse")

- 编辑

函数参数的工作原理如下:

文件是文件的位置。

origPattern是一个短语。

该功能基本上应该是模糊搜索。它应该采用模式并搜索文件以查找精确匹配或具有1个字符偏差的匹配。即:1个缺少字符,1个额外字符,1个替换字符或1个字符与相邻字符交换。

在大多数情况下它都有效,但我遇到了一些问题。

首先,当我尝试为origPattern使用类似'Board:isFull()'的内容时,我得到以下内容:

    raise error, v # invalid expression
sre_constants.error: unbalanced parenthesis

以上是来自re库

我尝试过使用re.escape()但它没有改变任何东西。

其次,当我尝试其他一些东西,比如'Fun()'时,它说它在某个索引上有一个匹配,甚至不包含任何一个;它只是一行“*”

第三,当找到匹配时,它并不总能找到所有匹配。例如,我有一个文件应该找到85个匹配,但它只出现像77,另一个有8,但它只有6个。但是,它们只是按字母顺序排列所以它可能只是我的问题做搜索或其他什么。

感谢任何帮助。

我也不能使用fuzzyfinder

0 个答案:

没有答案