在目录中,我有几个文件夹。从长文件夹路径列表的中间部分,通常会出现一个半结构化模式,它共享一个共同的父文件夹。只有一组文件夹名称可用,只有路径的排列和长度是唯一的。这是一个示例列表:
/project/a/b/static <--- not part of any chunk due to missing '(integer)' in /b/
/project/a/b/a/static <--- not part of any chunk
/project/a/b(1)/static
/project/a/b(1)/linked
/project/a/b(1)/flat
/project/c/c <--- not part of any chunk
/project/a/b(2)/static
/project/a/b(2)/linked
/project/a/b(2)/flat
/project/a/b(3)/static
/project/a/b(3)/linked
/project/a/b(3)/unique <--- part of this chunk due to same parent folder names
/project/a/b(3)/flat
/project/a/b(4)/static
/project/a/b(4)/linked
/project/a/b(4)/flat
/project/a/a/a/a/a/linked <---- not part of any chunk
基本上我想要做的是找出以上述方式分隔的类似文件夹路径的“块”。因此,最终结果将类似于“块”列表,并删除异常值。
这是我想到的伪代码,但我绝对想要根据字符串的长度或/和某种Levenshtein距离聚类相似的路径。
看来我需要进行近似字符串匹配而不是聚类?
答案 0 :(得分:4)
假设这些文件路径位于名为list
的{{1}}中,您可以使用paths
:
itertools.groupby
用作:
REGEX = re.compile(r'/b\((\d+)\)/') # change to match your needs
def path_grouper(path):
match = REGEX.search(path)
if match is None:
return (False, path)
return (True, int(match.group(1)))
如果您想要更灵活的方法,可以尝试使用difflib
标准模块。特别是,您可以使用In [6]: for (is_group, key), group in it.groupby(paths, path_grouper):
...: if is_group:
...: print('Got a group with key: {}\n'.format(key))
...: for path in group:
...: print(path)
...: print('End group')
...: else:
...: print('Got lonely path:\n')
...: for path in group:
...: print(path)
...:
Got lonely path:
/project/a/b/static
Got lonely path:
/project/a/b/a/static
Got a group with key: 1
/project/a/b(1)/static
/project/a/b(1)/linked
/project/a/b(1)/flat
End group
Got lonely path:
/project/c/c
Got a group with key: 2
/project/a/b(2)/static
/project/a/b(2)/linked
/project/a/b(2)/flat
End group
Got a group with key: 3
/project/a/b(3)/static
/project/a/b(3)/linked
/project/a/b(3)/unique
/project/a/b(3)/flat
End group
Got a group with key: 4
/project/a/b(4)/static
/project/a/b(4)/linked
/project/a/b(4)/flat
End group
Got lonely path:
/project/a/a/a/a/a/linked
或find_longest_match()
查看两条路径匹配的位置,并尝试确定是否应对其进行分组。
get_matching_blocks()
使用示例:
import difflib
def make_path_grouper():
matcher = difflib.SequenceMatcher()
def path_grouper(path):
if not matcher.a:
matcher.set_seq1(path)
return path
else:
matcher.set_seq2(path)
matchings = matcher.get_matching_blocks()
# arbitrary code to decide whether we have a match
if any(size > 14 for _,_,size in matchings):
# we have a match
return matcher.a
else:
# no match. The new path supersedes old "a"
matcher.set_seq1(path)
return path
return path_grouper
请注意,决定何时找到组的逻辑可以是任意复杂的,你应该决定它。我只是尝试了一个非常简单的启发式,使用匹配大小的阈值。对于In [15]: for key, group in it.groupby(paths, make_path_grouper()):
...: group = tuple(group)
...: if len(group) > 1:
...: print('Got a block: {}\n'.format(key))
...: for path in group:
...: print(path)
...: print('End block')
...: else:
...: print('Got lonely path:\n')
...: print(key)
Got lonely path:
/project/a/b/static
Got lonely path:
/project/a/b/a/static
Got a block: /project/a/b(1)/static
/project/a/b(1)/static
/project/a/b(1)/linked
/project/a/b(1)/flat
End block
Got lonely path:
/project/c/c
Got a block: /project/a/b(2)/static
/project/a/b(2)/static
/project/a/b(2)/linked
/project/a/b(2)/flat
End block
Got a block: /project/a/b(3)/static
/project/a/b(3)/static
/project/a/b(3)/linked
/project/a/b(3)/unique
/project/a/b(3)/flat
End block
Got a block: /project/a/b(4)/static
/project/a/b(4)/static
/project/a/b(4)/linked
/project/a/b(4)/flat
End block
Got lonely path:
/project/a/a/a/a/a/linked
,它与之前的代码输出的输出相同,但显然其他输入错误。
扩展一点解决方案,因为你只想匹配某个前缀,你可以这样做:
size = 14
然后,您可以定义自定义def make_path_grouper(prefix_checker):
matcher = difflib.SequenceMatcher()
def path_grouper(path):
if not matcher.a:
matcher.set_seq1(path)
return path
else:
matcher.set_seq2(path)
matchings = tuple(matcher.get_matching_blocks())
# arbitrary code to decide whether we have a match
if matchings and prefix_checker(matcher.a[:matchings[0][2]]):
# we have a match
return matcher.a
else:
# no match. The new path supersedes old "a"
matcher.set_seq1(path)
return path
return path_grouper
,以确定是否应将匹配的前缀视为一个组。一些例子:
prefix_checker
并按照以下方式使用它们:
def prefix_length_checker(length):
"""Consider group if the prefix is of at least the given length."""
return lambda x: len(x) >= length
def prefix_regex_checker(regex):
"""Consider group if the prefix matches a certain regex."""
return regex.match
def prefix_ratio_checker(pattern, threshold):
"""Consider group if the prefix is "similar" to a given pattern.
This fundamentally uses an extension of Levenstein distance.
"""
matcher = difflib.SequenceMatcher()
matcher.set_seq1(pattern)
def check_ratio(prefix, matcher=matcher):
matcher.set_seq2(prefix)
return matcher.ratio() >= threshold
return check_ratio
在你的情况下,grouper = make_path_grouper(prefix_ratio_checker('/project/a/b/', 0.8))
for key, group in it.groupby(paths, grouper):
就足够了。您只想匹配包含regex
部分的前缀。
可以扩展它以不仅检查前缀而且检查所有匹配的块,但是检查前缀应该足以满足您的用例。