更优雅的方式来实现类似regexp的量词

时间:2017-01-22 11:51:52

标签: python regex itertools

我正在编写一个简单的字符串解析器,它允许类似regexp的量词。输入字符串可能如下所示:

s = "x y{1,2} z"

我的解析器函数将此字符串转换为元组列表:

list_of_tuples = [("x", 1, 1), ("y", 1, 2), ("z", 1, 1)]

现在,棘手的一点是我需要一个由量化指定的所有有效组合的列表。组合都必须具有相同数量的元素,值None用于填充。对于给定的示例,预期输出为

[["x", "y", None, "z"], ["x", "y", "y", "z"]]

我确实有一个可行的解决方案,但我对此并不满意:它使用了两个嵌套的for循环,我发现代码有点模糊,所以通常很尴尬关于它的笨拙:

import itertools

def permute_input(lot):
    outer = []
    # is there something that replaces these nested loops?
    for val, start, end in lot:
        inner = []
        # For each tuple, create a list of constant length
        # Each element contains a different number of 
        # repetitions of the value of the tuple, padded
        # by the value None if needed.
        for i in range(start, end + 1):
            x = [val] * i + [None] * (end - i)
            inner.append(x)
        outer.append(inner)
    # Outer is now a list of lists.

    final = []
    # use itertools.product to combine the elements in the
    # list of lists:
    for combination in itertools.product(*outer):
        # flatten the elements in the current combination,
        # and append them to the final list:
        final.append([x for x 
                    in itertools.chain.from_iterable(combination)])
    return final

print(permute_input([("x", 1, 1), ("y", 1, 2), ("z", 1, 1)]))
[['x', 'y', None, 'z'], ['x', 'y', 'y', 'z']]

我怀疑这样做的方式更为优雅,可能隐藏在itertools模块的某个地方?

3 个答案:

答案 0 :(得分:6)

解决问题的另一种方法是使用pyparsing和此example regex parser将正则表达式扩展为可能的匹配字符串。对于您的x y{1,2} z示例字符串,它会生成两个可能的字符串来扩展量词:

$ python -i regex_invert.py 
>>> s = "x y{1,2} z"
>>> for item in invert(s):
...     print(item)
... 
x y z
x yy z

重复本身同时支持开放范围和闭合范围,并定义为:

repetition = (
    (lbrace + Word(nums).setResultsName("count") + rbrace) |
    (lbrace + Word(nums).setResultsName("minCount") + "," + Word(nums).setResultsName("maxCount") + rbrace) |
    oneOf(list("*+?"))
)

为了达到预期的结果,我们应该修改从recurseList生成器和返回列表而不是字符串产生结果的方式:

for s in elist[0].makeGenerator()():
    for s2 in recurseList(elist[1:]):
        yield [s] + [s2]  # instead of yield s + s2

然后,我们只需要flatten the result

$ ipython3 -i regex_invert.py 

In [1]: import collections

In [2]: def flatten(l):
   ...:     for el in l:
   ...:         if isinstance(el, collections.Iterable) and not isinstance(el, (str, bytes)):
   ...:             yield from flatten(el)
   ...:         else:
   ...:             yield el
   ...:             

In [3]: s = "x y{1,2} z"

In [4]: for option in invert(s):
   ...:     print(list(flatten(option)))
   ...: 
['x', ' ', 'y', None, ' ', 'z']
['x', ' ', 'y', 'y', ' ', 'z']

然后,如果需要,您可以过滤空白字符:

In [5]: for option in invert(s):
   ...:     print([item for item in flatten(option) if item != ' '])
   ...:     
['x', 'y', None, 'z']
['x', 'y', 'y', 'z']

答案 1 :(得分:2)

基于元组生成不同列表的部分可以使用列表解析来编写:

outer = []
for val, start, end in lot:
    # For each tuple, create a list of constant length
    # Each element contains a different number of 
    # repetitions of the value of the tuple, padded
    # by the value None if needed.
    outer.append([[val] * i + [None] * (end - i) for i in range(start, end + 1)])

(整个事情将再次与列表理解一起写,但它使代码更难以阅读恕我直言)。

另一方面,[x for x in itertools.chain.from_iterable(combination)]中的列表理解可以用更简洁的方式编写。实际上,重点是从迭代中构建一个实际列表。这可以通过以下方式完成:list(itertools.chain.from_iterable(combination))。另一种方法是使用sum内置。我不确定哪个更好。

最后,final.append部分可以用列表理解来编写。

# use itertools.product to combine the elements in the list of lists:
# flatten the elements in the current combination,
return [sum(combination, []) for combination in itertools.product(*outer)]

最终的代码只是基于你稍微重新编写的代码:

outer = []
for val, start, end in lot:
    # For each tuple, create a list of constant length
    # Each element contains a different number of 
    # repetitions of the value of the tuple, padded
    # by the value None if needed.
    outer.append([[val] * i + [None] * (end - i) for i in range(start, end + 1)])

# use itertools.product to combine the elements in the list of lists:
# flatten the elements in the current combination,
return [sum(combination, []) for combination in itertools.product(*outer)]

答案 2 :(得分:2)

递归解决方案(简单,最多可容纳几千个元组):

start == end

它受递归深度(~1000)的限制。如果还不够,可以对list_of_tuples个案进行简单的优化。取决于>>> list(permutations(list_of_tuples)) # list() because it's an iterator [['x', 'y', None, 'z'], ['x', 'y', 'y', 'z']] 的预期大小可能就足够了

测试:

def permutations(lot):
    source = []
    cnum = 1  # number of possible combinations
    for item, start, end in lot:  # create full list without Nones
        source += [item] * (end-start+1)
        cnum *= (end-start+1)

    for i in range(cnum):
        bitmask = [True] * len(source)
        state = i
        pos = 0
        for _, start, end in lot:
            state, m = divmod(state, end-start+1)  # m - number of Nones to insert
            pos += end-start+1
            bitmask[pos-m:pos] = [None] * m
        yield [bitmask[i] and c for i, c in enumerate(source)]

没有递归(通用但不太优雅):

xyyz

这个解决方案背后的想法:实际上,我们看起来像是一个完整的字符串(None),尽管玻璃加了一定数量的(end-start+1)。我们可以通过计算所有{{1}}的乘积来计算可能组合的数量。然后,我们可以对所有迭代(简单范围循环)进行编号,并从迭代次数重建此掩码。在这里,我们通过在状态数上迭代使用divmod并使用余数作为符号位置的Nones数来重建掩码