大多数Pythonic方法通过重复元素拆分数组

时间:2011-12-16 05:33:47

标签: python

我有一个基于分隔符要分割的项目列表。我希望删除所有分隔符,并在分隔符出现两次时拆分列表。例如,如果分隔符为'X',则为以下列表:

['a', 'b', 'X', 'X', 'c', 'd', 'X', 'X', 'f', 'X', 'g']

会变成:

[['a', 'b'], ['c', 'd'], ['f', 'g']]

请注意,最后一组未拆分。

我写了一些丑陋的代码来做到这一点,但我确信有更好的东西。如果你可以设置一个任意长度分隔符(即看到N个分隔符后拆分列表),可以加分。

11 个答案:

答案 0 :(得分:13)

我不认为这会有一个很好的,优雅的解决方案(我当然希望被证明是错误的)所以我会建议一些简单明了的事情:

def nSplit(lst, delim, count=2):
    output = [[]]
    delimCount = 0
    for item in lst:
        if item == delim:
            delimCount += 1
        elif delimCount >= count:
            output.append([item])
            delimCount = 0
        else:
            output[-1].append(item)
            delimCount = 0
    return output

>>> nSplit(['a', 'b', 'X', 'X', 'c', 'd', 'X', 'X', 'f', 'X', 'g'], 'X', 2)
[['a', 'b'], ['c', 'd'], ['f', 'g']]

答案 1 :(得分:4)

以下是使用itertools.groupby()

的方法
import itertools

class MultiDelimiterKeyCallable(object):
    def __init__(self, delimiter, num_wanted=1):
        self.delimiter = delimiter
        self.num_wanted = num_wanted

        self.num_found = 0

    def __call__(self, value):
        if value == self.delimiter:
            self.num_found += 1
            if self.num_found >= self.num_wanted:
                self.num_found = 0
                return True
        else:
            self.num_found = 0

def split_multi_delimiter(items, delimiter, num_wanted):
    keyfunc = MultiDelimiterKeyCallable(delimiter, num_wanted)

    return (list(item
                 for item in group
                 if item != delimiter)
            for key, group in itertools.groupby(items, keyfunc)
            if not key)

items = ['a', 'b', 'X', 'X', 'c', 'd', 'X', 'X', 'f', 'X', 'g']

print list(split_multi_delimiter(items, "X", 2))

我必须说,对于相同的结果,cobbal的解决方案要简单得多。

答案 2 :(得分:4)

使用生成器函数通过列表维护迭代器的状态,以及到目前为止看到的分隔符字符数的计数:

l = ['a', 'b', 'X', 'X', 'c', 'd', 'X', 'X', 'f', 'X', 'g'] 

def splitOn(ll, x, n):
    cur = []
    splitcount = 0
    for c in ll:
        if c == x:
            splitcount += 1
            if splitcount == n:
                yield cur
                cur = []
                splitcount = 0
        else:
            cur.append(c)
            splitcount = 0
    yield cur

print list(splitOn(l, 'X', 2))
print list(splitOn(l, 'X', 1))
print list(splitOn(l, 'X', 3))

l += ['X','X']
print list(splitOn(l, 'X', 2))
print list(splitOn(l, 'X', 1))
print list(splitOn(l, 'X', 3))

打印:

[['a', 'b'], ['c', 'd'], ['f', 'g']]
[['a', 'b'], [], ['c', 'd'], [], ['f'], ['g']]
[['a', 'b', 'c', 'd', 'f', 'g']]
[['a', 'b'], ['c', 'd'], ['f', 'g'], []]
[['a', 'b'], [], ['c', 'd'], [], ['f'], ['g'], [], []]
[['a', 'b', 'c', 'd', 'f', 'g']]
编辑:我也是groupby的忠实粉丝,这是我的目标:

from itertools import groupby
def splitOn(ll, x, n):
    cur = []
    for isdelim,grp in groupby(ll, key=lambda c:c==x):
        if isdelim:
            nn = sum(1 for c in grp)
            while nn >= n:
                yield cur
                cur = []
                nn -= n
        else:
            cur.extend(grp)
    yield cur

与我之前的回答没有什么不同,只是让groupby负责迭代输入列表,创建分隔符匹配和非分隔符匹配字符组。不匹配的字符只是添加到当前元素上,匹配的字符组执行分解新元素的工作。对于长列表,这可能会更高效,因为groupby在C中完成所有工作,并且仍然只迭代列表一次。

答案 3 :(得分:3)

a = ['a', 'b', 'X', 'X', 'c', 'd', 'X', 'X', 'f', 'X', 'g']
b = [[b for b in q if b != 'X'] for q in "".join(a).split("".join(['X' for i in range(2)]))]

这给出了

[['a', 'b'], ['c', 'd'], ['f', 'g']]

其中2是您想要的元素数量。最有可能采用更好的方法。

答案 4 :(得分:2)

非常难看,但是我想知道我是否可以将其作为一个单行使用,我想我会分享。我请求你不要将这个解决方案用于任何重要的事情。最后的('X', 3)是分隔符,应该重复它的次数。

(lambda delim, count: map(lambda x:filter(lambda y:y != delim, x), reduce(lambda x, y: (x[-1].append(y) if y != delim or x[-1][-count+1:] != [y]*(count-1) else x.append([])) or x, ['a', 'b', 'X', 'X', 'c', 'd', 'X', 'X', 'f', 'X', 'g'], [[]])))('X', 2)

修改

这是一个细分。我还删除了一些冗余代码,这些代码在写出来时更加明显。 (也改为上面)

# Wrap everything in a lambda form to avoid repeating values
(lambda delim, count:
    # Filter all sublists after construction
    map(lambda x: filter(lambda y: y != delim, x), reduce(
        lambda x, y: (
            # Add the value to the current sub-list
            x[-1].append(y) if
                # but only if we have accumulated the
                # specified number of delimiters
                y != delim or x[-1][-count+1:] != [y]*(count-1) else

                # Start a new sublist
                x.append([]) or x,
        ['a', 'b', 'X', 'X', 'c', 'd', 'X', 'X', 'f', 'X', 'g'], [[]])
    )
)('X', 2)

答案 5 :(得分:1)

这是一个使用zip和生成器的干净漂亮的解决方案

#1 define traditional sequence split function 
#if you only want it for lists, you can use indexing to make it shorter
def split(it, x):
    to_yield = []
    for y in it:
        if x == y:
            yield to_yield
            to_yield = []
        else:
            to_yield.append(y)
    if to_yield:
        yield to_yield

#2 zip the sequence with its tail 
#you could use itertools.chain to avoid creating unnecessary lists
zipped = zip(l, l[1:] + [''])

#3. remove ('X',not 'X')'s from the resulting sequence, and leave only the first position of each
# you can use list comprehension instead of generator expression
filtered = (x for x,y in zipped if not (x == 'X' and y != 'X'))

#4. split the result using traditional split
result = [x for x in split(filtered, 'X')]

这样split()更可重复使用。

令人惊讶的是python没有内置的。

编辑:

您可以轻松地调整它以获得更长的分割序列,重复步骤2-3并使用l [i:]进行压缩过滤,使其为0<我< = n。

答案 6 :(得分:1)

import re    
map(list, re.sub('(?<=[a-z])X(?=[a-z])', '', ''.join(lst)).split('XX'))

这是一个列表 - &gt; string - &gt;列表转换并假定非分隔符都是小写字母。

答案 7 :(得分:0)

这是另一种方法:

def split_multi_delimiter(items, delimiter, num_wanted):
    def remove_delimiter(objs):
        return [obj for obj in objs if obj != delimiter]

    ranges = [(index, index+num_wanted) for index in xrange(len(items))
              if items[index:index+num_wanted] == [delimiter] * num_wanted]

    last_end = 0
    for range_start, range_end in ranges:
        yield remove_delimiter(items[last_end:range_start])
        last_end = range_end

    yield remove_delimiter(items[last_end:])

items = ['a', 'b', 'X', 'X', 'c', 'd', 'X', 'X', 'f', 'X', 'g']
print list(split_multi_delimiter(items, "X", 2))

答案 8 :(得分:0)

In [6]: input = ['a', 'b', 'X', 'X', 'cc', 'XX', 'd', 'X', 'ee', 'X', 'X', 'f']

In [7]: [s.strip('_').split('_') for s in '_'.join(input).split('X_X')]
Out[7]: [['a', 'b'], ['cc', 'XX', 'd', 'X', 'ee'], ['f']]

这假设您可以使用输入中找不到的_等保留字符。

答案 9 :(得分:0)

太聪明了一半,只提供了因为显而易见的正确方法看起来如此蛮力和丑陋:

class joiner(object):
  def __init__(self, N, data = (), gluing = False):
    self.data = data
    self.N = N
    self.gluing = gluing
  def __add__(self, to_glue):
    # Process an item from itertools.groupby, by either
    # appending the data to the last item, starting a new item,
    # or changing the 'gluing' state according to the number of
    # consecutive delimiters that were found.
    N = self.N
    data = self.data
    item = list(to_glue[1])
    # A chunk of delimiters;
    # return a copy of self with the appropriate gluing state.
    if to_glue[0]: return joiner(N, data, len(item) < N)
    # Otherwise, handle the gluing appropriately, and reset gluing state.
    a, b = (data[:-1], data[-1] if data else []) if self.gluing else (data, [])
    return joiner(N, a + (b + item,))

def split_on_multiple(data, delimiter, N):
  # Split the list into alternating groups of delimiters and non-delimiters,
  # then use the joiner to join non-delimiter groups when the intervening
  # delimiter group is short.
  return sum(itertools.groupby(data, delimiter.__eq__), joiner(N)).data

答案 10 :(得分:0)

正则表达式,我选择你了!

import re

def split_multiple(delimiter, input):
    pattern = ''.join(map(lambda x: ',' if x == delimiter else ' ', input))
    filtered = filter(lambda x: x != delimiter, input)
    result = []
    for k in map(len, re.split(';', ''.join(re.split(',',
        ';'.join(re.split(',{2,}', pattern)))))):
        result.append([])
        for n in range(k):
            result[-1].append(filtered.__next__())
    return result

print(split_multiple('X',
    ['a', 'b', 'X', 'X', 'c', 'd', 'X', 'X', 'f', 'X', 'g']))

哦,你说的是Python,而不是Perl。