Question

假设这两组字符串：

file=sheet-2016-12-08.xlsx
file=sheet-2016-11-21.xlsx
file=sheet-2016-11-12.xlsx
file=sheet-2016-11-08.xlsx
file=sheet-2016-10-22.xlsx
file=sheet-2016-09-29.xlsx
file=sheet-2016-09-05.xlsx
file=sheet-2016-09-04.xlsx

size=1024KB
size=22KB
size=980KB
size=15019KB
size=202KB

我需要分别在这两个集合上运行一个函数，并分别接收以下输出：

file=sheet-2016-*.xlsx

size=*KB

数据集可以是任何字符串集。它不必匹配格式。这是另一个例子：

id.4030.paid
id.1280.paid
id.88.paid

预期输出为：

id.*.paid

基本上，我需要一个函数来分析一组字符串，并用星号（*）替换不常见的子字符串

Answer 1

您可以使用os.path.commonprefix来计算公共前缀。它用于计算文件路径列表中的共享目录，但它可以在通用上下文中使用。

然后反转字符串，再次应用公共前缀，然后反向，计算公共后缀（改编自https://gist.github.com/willwest/ca5d050fdf15232a9e67）

dataset = """id.4030.paid
id.1280.paid
id.88.paid""".splitlines()

import os


# Return the longest common suffix in a list of strings
def longest_common_suffix(list_of_strings):
    reversed_strings = [s[::-1] for s in list_of_strings]
    return os.path.commonprefix(reversed_strings)[::-1]

common_prefix = os.path.commonprefix(dataset)
common_suffix = longest_common_suffix(dataset)

print("{}*{}".format(common_prefix,common_suffix))

结果：

id.*.paid

编辑：正如wim所说：

当所有字符串相等时，公共前缀＆amp;后缀是相同的，但它应该返回字符串本身而不是prefix*suffix：应检查所有字符串是否相同
当共同前缀＆amp;后缀重叠/有共享字母，这也会混淆计算：应该计算字符串上的公共后缀减去公共前缀

因此，需要一个多合一的方法来预先测试列表，以确保至少2个字符串不同（在过程中缩小前缀/后缀公式），并使用切片计算公共后缀以删除公共前缀：

def compute_generic_string(dataset):
    # edge case where all strings are the same
    if len(set(dataset))==1:
        return dataset[0]

    commonprefix = os.path.commonprefix(dataset)

    return "{}*{}".format(commonprefix,os.path.commonprefix([s[len(commonprefix):][::-1] for s in dataset])[::-1])

现在让我们测试一下：

for dataset in [['id.4030.paid','id.1280.paid','id.88.paid'],['aBBc', 'aBc'],[]]:
    print(compute_generic_string(dataset))

结果：

id.*.paid
aB*c
*

（当数据集为空时，代码返回*，也许这应该是另一个边缘情况）

Answer 2

from os.path import commonprefix

def commonsuffix(m):
    return commonprefix([s[::-1] for s in m])[::-1]

def inverse_glob(strs):
    start = commonprefix(strs)
    n = len(start)
    ends = [s[n:] for s in strs]
    end = commonsuffix(ends)
    if start and not any(ends):
        return start
    else:
        return start + '*' + end

这个问题比面值看起来更棘手。

如目前所述，问题仍然受到限制，即没有独特的解决方案。对于输入['spamAndEggs', 'spamAndHamAndEggs']，spam*AndEggs和spamAnd*Eggs都是有效答案。对于输入['aXXXXz', 'aXXXz']，有四个可能的解决方案。在上面给出的代码中，我们更喜欢选择最长的前缀，以使解决方案独一无二。

归功于JFF's answer指出os.path.commonprefix的存在。

Inverse glob - reverse engineer a wildcard string from file names是这个问题的一个相关且更困难的概括。

在字符串列表中标记动态子字符串

2 个答案: