Question

如果我有一个可以附加到字符串的前缀列表，我该如何将字符串分成它的前缀和下一个子字符串中的其他字符。例如：

prefixes = ['over','under','re','un','co']

str1 = "overachieve"
output: ["over","achieve"]

str2 = "reundo"
output = ["re","un","do"]

有没有更好的方法来执行上述任务，可能使用正则表达式或除了以下的一些字符串函数：

str1 = "reundo"
output = []

for x in [p for p in prefixes if p in str1]:
    output.append(x)    
    str1 =  str1.replace(x,"",1)
output.append(str1)

Answer 1

正则表达式是搜索许多替代前缀的有效方法：

import re

def split_prefixes(word, prefixes):
    regex = re.compile('|'.join(sorted(prefixes, key=len, reverse=True)))
    result = []
    i = 0
    while True:
        mo = regex.match(word, i)
        if mo is None:
            result.append(word[i:])
            return result
        result.append(mo.group())
        i = mo.end()


>>> prefixes = ['over', 'under', 're', 'un', 'co']
>>> for word in ['overachieve', 'reundo', 'empire', 'coprocessor']:
        print word, '-->', split_prefixes(word, prefixes)

overachieve --> ['over', 'achieve']
reundo --> ['re', 'un', 'do']
empire --> ['empire']
coprocessor --> ['co', 'processor']

Answer 2

我会使用str.startswith方法

for p in prefixes:
    if str1.startswith(p):
        output.append(p)
        str1 = str1.replace(p, '', 1)
output.append(str1)

您的代码存在的最大缺陷是像'found'这样的字符串会输出['un', 'fod']。

但是，如果您有一个假设的字符串'reuncoundo'，那么您需要多次迭代该列表。

while True:
    if not any(str1.startswith(i) for i in prefixes):
        output.append(str1)
        break
    for p in prefixes:
        if str1.startswith(p):
            output.append(p)
            str1 = str1.replace(p, '', 1)

这会输出['re', 'un', 'co', 'un', 'do']

Answer 3

prefixes = ['over','under','re','un','co']

def test(string, prefixes, existing=None):
    prefixes.sort(key = lambda s: len(s))
    prefixes.reverse() # This and the previous line ensure that longer prefixes are searched first regardless of initial sorting.
    if existing is None:
        existing = [] # deals with the fact that placing [] as a default parameter and modifying it modifies it for the entire session
    for prefix in prefixes:
        if string.startswith(prefix):
            existing.append(prefix)
            return test(string[len(prefix):], prefixes, existing)
    existing.append(string)
    return existing

此代码以递归方式运行字符串，删除已知前缀，直到它用完为止，然后返回整个列表。在较长的字符串上，生成器可能是更好的路径，但在较短的字符串上，不需要额外的生成器开销可能会使这成为更好的解决方案。

Answer 4

考虑到“两个问题”的谚语，我仍然会说这是正则表达式的工作。正则表达式编译为状态机，它们并行检查所有可能的变体，而不是逐个检查。

这是一个利用它的实现：

import re

def split_string(string, prefixes):
    regex = re.compile('|'.join(map(re.escape, prefixes))) # (1)
    while True:
        match = regex.match(string)
        if not match:
            break
        end = match.end()
        yield string[:end]
        string = string[end:]
    if string:
        yield string # (2)

prefixes = ['over','under','re','un','co']
assert (list(split_string('recouncoundo',prefixes))
        == ['re','co','un','co','un','do'])

注意如何在（1）中构造正则表达式：

使用re.escape转义前缀，以便特殊字符不会干扰
使用|（或）正则表达式运算符
整个事情被编译。

如果在分割前缀后遗留任何字，则第（2）行产生最后一个字。如果希望函数在前缀剥离后没有任何内容返回空字符串，则可能需要删除if string检查。

另请注意，re.match（与re.search相反）仅查找输入字符串开头的模式，因此无需将^附加到正则表达式。

Answer 5

如果您正在处理前缀，则不需要正则表达式，只需要startswith()。你当然可以使用正则表达式，但它更难以阅读和维护，即使是这样的简单。在我看来，startswith()更简单。

对于这样一个简单的问题，其他答案似乎太复杂了。我建议像这样的递归函数：

def split_prefixes (word, prefixes):
    split = [p for p in prefixes if word.startswith(p)]
    if split:
        return split + split_prefixes (word[len(split[0]):], prefixes)
    else:
        return [word]

结果如下：

"overachieve" -> ['over', 'achieve']
"reundo" -> ['re', 'un', 'do']
"reuncoundo" -> ['re', 'un', 'co', 'un', 'do']
"empire" -> ['empire']

递归拆分包含一组已定义前缀的字符串 - Python

5 个答案: