像fzf或CtrlP这样的模糊字符串匹配器实用程序会过滤具有给定搜索字符串作为子序列的字符串列表。 例如,考虑用户想要在文件列表中搜索特定照片。要查找文件
/home/user/photos/2016/pyongyang_photo1.png
只需输入ph2016png
,因为此搜索字符串是此文件名的子序列。 (请注意,这不是LCS。整个搜索字符串必须是文件名的子序列。)
检查给定搜索字符串是否是另一个字符串的子序列是微不足道的,但我想知道如何有效地获得最佳匹配:在上面的示例中,有多个可能的匹配。一个是
/home/user/photos/2016/pyongyang_photo1.png
但是用户可能想到的那个是
/home/user/photos/2016/pyongyang_photo1.png
为了形式化,我将“最佳”匹配定义为由最少数量的子串组成的匹配。第一个示例匹配的数字为5,第二个匹配的数字为3。
我想出了这个,因为获得最佳匹配来为每个结果分配分数,进行排序会很有趣。我对近似解决方案不感兴趣,我对这个问题的兴趣主要是学术性的。
给定字符串s
和t
,在t
的子序列中查找等于s
的子序列,以最大化t
中连续的元素对的数量{1}}。
为了便于讨论,我们调用搜索查询s
和字符串来测试t
。问题的解决方案表示为fuzzy(s, t)
。我将使用Python的字符串切片表示法。最简单的方法如下:
由于任何解决方案必须按顺序使用s
中的所有字符,因此可以通过搜索s[0]
中t
的第一个匹配项(索引{{1}来开始解决此问题的算法然后使用两个解决方案中的更好的
i
这显然不是解决此问题的最佳方法。相反,这是明显的蛮力之一。 (我已经玩过同时搜索t[:i+1] + fuzzy(s[1:], t[i+1:]) # Use the character
t[:i] + fuzzy(s, t[i+1:]) # Skip it and use the next occurence
# of s[0] in t instead
的最后一次出现并在此问题的早期版本中使用此信息,但事实证明这种方法不起作用。)
→我的问题是:这个问题最有效的解决办法是什么?
答案 0 :(得分:1)
这可能不是最有效的解决方案,但它是一种高效且易于实施的解决方案。为了说明,我借用你的例子。让/home/user/photos/2016/pyongyang_photo1.png
为文件名,ph2016png
为输入。
第一步(预先计算)是可选的,但可能有助于加快下一步(设置)的速度,特别是如果您将算法应用于许多文件名。
预计算
创建一个表,计算输入中每个字符的出现次数。由于您可能只处理ASCII字符,因此256个条目就足够了(可能是128个,甚至更少,具体取决于字符集)。
"ph2016png"
['p'] : 2
['h'] : 1
['2'] : 1
['0'] : 1
['b'] : 0
...
设置
通过丢弃输入中不存在的字符将文件名切成子串。同时,检查输入的每个字符是否在文件名中出现正确的次数(如果预先计算完成)。最后,检查输入的每个字符是否在子字符串列表中按顺序显示。如果将子字符串列表作为单个字符串,对于该字符串的任何给定字符,必须在该字符串中找到在输入之前找到的每个字符。这可以在创建子串时完成。
"/home/user/photos/2016/pyongyang_photo1.png"
"h", "ph", "2016", "p", "ng", "ng", "ph", "1", "png"
'p' must come before "h", so throw this one away
"ph", "2016", "p", "ng", "ng", "ph", "1", "png"
核心
将最长的子字符串与输入匹配,并跟踪最长匹配。此匹配可以保留子字符串的开头(例如,匹配ababa
(子字符串)与babaa
(输入)将导致aba
,而不是baba
),因为它&# 39;更容易实现,尽管它不是必须的。如果您没有完全匹配,请使用最长的匹配再次对子字符串进行切片,然后使用下一个最长的子字符串重试。
Since there is no instance of incomplete match with your example,
let's take something else, made to illustrate the point.
Let's take "babaaababcb" as the filename, and "ababb" as input.
Substrings : "abaaabab", "b"
Longest substring : "abaaabab"
If you keep the beginning of matches
Longest match : "aba"
Slice "abaaabab" into "aba", "aabab"
-> "aba", "aabab", "b"
Retry with "aabab"
-> "aba", "a", "abab", "b"
Retry with "abab" (complete match)
Otherwise (harder to implement, not necessarily better performing, as shown in this example)
Longest match : "abab"
Slice "abaaabab" into "abaa", "abab"
-> "abaa", "abab", "b"
Retry with "abaa"
-> "aba", "a", "abab", "b"
Retry with "abab" (complete match)
如果你得到完全匹配,继续将输入切成两个以及子串列表,并重复匹配最长的子串。
With "ph2016png" as input
Longest substring : "2016"
Complete match
Match substrings "h", "ph" with input "ph"
Match substrings "p", "ng", "ng", "ph", "1", "png" with input "png"
您可以保证找到包含最少子串的子串序列,因为您首先尝试最长的子串。如果输入不包含文件名中的许多短子串,那通常会表现良好。
答案 1 :(得分:1)
我建议创建一个搜索树,其中每个节点代表大海捞针中与其中一个针字符匹配的字符位置。
顶部节点是兄弟姐妹,代表大海捞针中第一个针字符的出现。
父节点的子节点是那些表示大海捞针中下一个针字符出现的节点,但只是那些位于该父节点所代表的位置之后的节点。
这在逻辑上意味着一些孩子由几个父母共享,因此这个结构实际上不是一棵树,而是一个有向无环图。有些兄弟姐妹甚至可能有完全相同的孩子。其他父母可能根本没有孩子:他们是一个死胡同,除非他们位于图表的底部,其中叶子代表最后针头角色的位置。
一旦设置了这个图形,其中的深度优先搜索可以很容易地从某个节点开始导出仍然需要的段的数量,然后在替代方案中最小化。
我在下面的Python代码中添加了一些注释。此代码可能仍会得到改进,但与您的解决方案相比,它似乎已经非常有效。
def fuzzy_trincot(haystack, needle, returnSegments = False):
inf = float('inf')
def getSolutionAt(node, depth, optimalCount = 2):
if not depth: # reached end of needle
node['count'] = 0
return
minCount = inf # infinity ensures also that incomplete branches are pruned
child = node['child']
i = node['i']+1
# Optimisation: optimalCount gives the theoretical minimum number of
# segments needed for any solution. If we find such case,
# there is no need to continue the search.
while child and minCount > optimalCount:
# If this node was already evaluated, don't lose time recursing again.
# It works without this condition, but that is less optimal.
if 'count' not in child:
getSolutionAt(child, depth-1, 1)
count = child['count'] + (i < child['i'])
if count < minCount:
minCount = count
child = child['sibling']
# Store the results we found in this node, so if ever we come here again,
# we don't need to recurse the same sub-tree again.
node['count'] = minCount
# Preprocessing: build tree
# A node represents a needle character occurrence in the haystack.
# A node can have these keys:
# i: index in haystack where needle character occurs
# child: node that represents a match, at the right of this index,
# for the next needle character
# sibling: node that represents the next match for this needle character
# count: the least number of additional segments needed for matching the
# remaining needle characters (only; so not counting the segments
# already taken at the left)
root = { 'i': -2, 'child': None, 'sibling': None }
# Take a short-cut for when needle is a substring of haystack
if haystack.find(needle) != -1:
root['count'] = 1
else:
parent = root
leftMostIndex = 0
rightMostIndex = len(haystack)-len(needle)
for j, c in enumerate(needle):
sibling = None
child = None
# Use of leftMostIndex is an optimisation; it works without this argument
i = haystack.find(c, leftMostIndex)
# Use of rightMostIndex is an optimisation; it works without this test
while 0 <= i <= rightMostIndex:
node = { 'i': i, 'child': None, 'sibling': None }
while parent and parent['i'] < i:
parent['child'] = node
parent = parent['sibling']
if sibling: # not first child
sibling['sibling'] = node
else: # first child
child = node
leftMostIndex = i+1
sibling = node
i = haystack.find(c, i+1)
if not child: return False
parent = child
rightMostIndex += 1
getSolutionAt(root, len(needle))
count = root['count']
if not returnSegments:
return count
# Use the `returnSegments` option when you need the character content
# of the segments instead of only the count. It runs in linear time.
if count == 1: # Deal with short-cut case
return [needle]
segments = []
node = root['child']
i = -2
start = 0
for end, c in enumerate(needle):
i += 1
# Find best child among siblings
while (node['count'] > count - (i < node['i'])):
node = node['sibling']
if count > node['count']:
count = node['count']
if end:
segments.append(needle[start:end])
start = end
i = node['i']
node = node['child']
segments.append(needle[start:])
return segments
可以使用可选的第三个参数调用该函数:
haystack = "/home/user/photos/2016/pyongyang_photo1.png"
needle = "ph2016png"
print (fuzzy_trincot(haystack, needle))
print (fuzzy_trincot(haystack, needle, True))
输出:
3
['ph', '2016', 'png']
由于该函数已优化为仅返回计数,因此第二次调用将在执行时添加一点。