如何聚类类似的文件路径?

时间:2013-12-31 07:20:11

标签: python

在目录中,我有几个文件夹。从长文件夹路径列表的中间部分,通常会出现一个半结构化模式,它共享一个共同的父文件夹。只有一组文件夹名称可用,只有路径的排列和长度是唯一的。这是一个示例列表:

/project/a/b/static <--- not part of any chunk due to missing '(integer)' in /b/
/project/a/b/a/static <--- not part of any chunk

/project/a/b(1)/static
/project/a/b(1)/linked
/project/a/b(1)/flat

/project/c/c  <--- not part of any chunk

/project/a/b(2)/static
/project/a/b(2)/linked
/project/a/b(2)/flat

/project/a/b(3)/static
/project/a/b(3)/linked
/project/a/b(3)/unique <--- part of this chunk due to same parent folder names
/project/a/b(3)/flat

/project/a/b(4)/static
/project/a/b(4)/linked
/project/a/b(4)/flat

/project/a/a/a/a/a/linked <---- not part of any chunk

基本上我想要做的是找出以上述方式分隔的类似文件夹路径的“块”。因此,最终结果将类似于“块”列表,并删除异常值。

这是我想到的伪代码,但我绝对想要根据字符串的长度或/和某种Levenshtein距离聚类相似的路径。

看来我需要进行近似字符串匹配而不是聚类?

1 个答案:

答案 0 :(得分:4)

假设这些文件路径位于名为list的{​​{1}}中,您可以使用paths

itertools.groupby

用作:

REGEX = re.compile(r'/b\((\d+)\)/')  # change to match your needs

def path_grouper(path):
    match = REGEX.search(path)
    if match is None:
        return (False, path)
    return (True, int(match.group(1)))

如果您想要更灵活的方法,可以尝试使用difflib标准模块。特别是,您可以使用In [6]: for (is_group, key), group in it.groupby(paths, path_grouper): ...: if is_group: ...: print('Got a group with key: {}\n'.format(key)) ...: for path in group: ...: print(path) ...: print('End group') ...: else: ...: print('Got lonely path:\n') ...: for path in group: ...: print(path) ...: Got lonely path: /project/a/b/static Got lonely path: /project/a/b/a/static Got a group with key: 1 /project/a/b(1)/static /project/a/b(1)/linked /project/a/b(1)/flat End group Got lonely path: /project/c/c Got a group with key: 2 /project/a/b(2)/static /project/a/b(2)/linked /project/a/b(2)/flat End group Got a group with key: 3 /project/a/b(3)/static /project/a/b(3)/linked /project/a/b(3)/unique /project/a/b(3)/flat End group Got a group with key: 4 /project/a/b(4)/static /project/a/b(4)/linked /project/a/b(4)/flat End group Got lonely path: /project/a/a/a/a/a/linked find_longest_match()查看两条路径匹配的位置,并尝试确定是否应对其进行分组。

get_matching_blocks()

使用示例:

import difflib

def make_path_grouper():
    matcher = difflib.SequenceMatcher()
    def path_grouper(path):
        if not matcher.a:
            matcher.set_seq1(path)
            return path
        else:
            matcher.set_seq2(path)
            matchings = matcher.get_matching_blocks()
            # arbitrary code to decide whether we have a match
            if any(size > 14 for _,_,size in matchings):
                # we have a match
                return matcher.a
            else:
                # no match. The new path supersedes old "a"
                matcher.set_seq1(path)
                return path
    return path_grouper

请注意,决定何时找到组的逻辑可以是任意复杂的,应该决定它。我只是尝试了一个非常简单的启发式,使用匹配大小的阈值。对于In [15]: for key, group in it.groupby(paths, make_path_grouper()): ...: group = tuple(group) ...: if len(group) > 1: ...: print('Got a block: {}\n'.format(key)) ...: for path in group: ...: print(path) ...: print('End block') ...: else: ...: print('Got lonely path:\n') ...: print(key) Got lonely path: /project/a/b/static Got lonely path: /project/a/b/a/static Got a block: /project/a/b(1)/static /project/a/b(1)/static /project/a/b(1)/linked /project/a/b(1)/flat End block Got lonely path: /project/c/c Got a block: /project/a/b(2)/static /project/a/b(2)/static /project/a/b(2)/linked /project/a/b(2)/flat End block Got a block: /project/a/b(3)/static /project/a/b(3)/static /project/a/b(3)/linked /project/a/b(3)/unique /project/a/b(3)/flat End block Got a block: /project/a/b(4)/static /project/a/b(4)/static /project/a/b(4)/linked /project/a/b(4)/flat End block Got lonely path: /project/a/a/a/a/a/linked ,它与之前的代码输出的输出相同,但显然其他输入错误。


扩展一点解决方案,因为你只想匹配某个前缀,你可以这样做:

size = 14

然后,您可以定义自定义def make_path_grouper(prefix_checker): matcher = difflib.SequenceMatcher() def path_grouper(path): if not matcher.a: matcher.set_seq1(path) return path else: matcher.set_seq2(path) matchings = tuple(matcher.get_matching_blocks()) # arbitrary code to decide whether we have a match if matchings and prefix_checker(matcher.a[:matchings[0][2]]): # we have a match return matcher.a else: # no match. The new path supersedes old "a" matcher.set_seq1(path) return path return path_grouper ,以确定是否应将匹配的前缀视为一个组。一些例子:

prefix_checker

并按照以下方式使用它们:

def prefix_length_checker(length):
    """Consider group if the prefix is of at least the given length."""
    return lambda x: len(x) >= length

def prefix_regex_checker(regex):
    """Consider group if the prefix matches a certain regex."""
    return regex.match

def prefix_ratio_checker(pattern, threshold):
    """Consider group if the prefix is "similar" to a given pattern.

    This fundamentally uses an extension of Levenstein distance.
    """
    matcher = difflib.SequenceMatcher()
    matcher.set_seq1(pattern)
    def check_ratio(prefix, matcher=matcher):
        matcher.set_seq2(prefix)
        return matcher.ratio() >= threshold
    return check_ratio

在你的情况下,grouper = make_path_grouper(prefix_ratio_checker('/project/a/b/', 0.8)) for key, group in it.groupby(paths, grouper): 就足够了。您只想匹配包含regex部分的前缀。

可以扩展它以不仅检查前缀而且检查所有匹配的块,但是检查前缀应该足以满足您的用例。