Question

我看过这个：Split list into sublist based on index ranges

但我的问题略有不同。我有一个清单

body

我需要根据日期将其拆分为子列表。基本上它是一个事件日志，但是由于糟糕的数据库设计，系统将事件的单独更新消息串联成一个大的字符串列表。我有：

我的例子将给出：

现在我需要根据索引将列表拆分为单独的列表。所以对于我的例子，理想情况下我想得到：

DOM

所以格式为：

还有一些边缘情况，其中没有日期字符串，其格式为：

Answer 1

您根本不需要执行两次通过分组，因为您可以在一次通过中使用itertools.groupby按日期及其相关事件进行细分。通过避免计算索引然后使用它们切片list，您可以处理一次提供一个值的生成器，如果您的输入很大，则可以避免内存问题。为了演示，我已经拍摄了原始的List并对其进行了扩展，以便正确显示处理边缘情况：

import re

from itertools import groupby

List = ['undated', 'garbage', 'then', 'twodates', '2015-12-31',
        '2016-01-01', 'stuff happened', 'details', 
        '2016-01-02', 'more stuff happened', 'details', 'report',
        '2016-01-03']

datere = re.compile(r"\d+\-\d+\-\d+")  # Precompile regex for speed
def group_by_date(it):
    # Make iterator that groups dates with dates and non-dates with dates
    grouped = groupby(it, key=lambda x: datere.match(x) is not None)
    for isdate, g in grouped:
        if not isdate:
            # We had a leading set of undated events, output as undated
            yield ['', list(g)]
        else:
            # At least one date found; iterate with one loop delay
            # so final date can have events included (all others have no events)
            lastdate = next(g)
            for date in g:
                yield [lastdate, []]
                lastdate = date

            # Final date pulls next group (which must be events or the end of the input)
            try:
                # Get next group of events
                events = list(next(grouped)[1])
            except StopIteration:
                # There were no events for final date
                yield [lastdate, []]
            else:
                # There were events associated with final date
                yield [lastdate, events]

print(list(group_by_date(List)))

输出（为了便于阅读而添加了新行）：

[['', ['undated', 'garbage', 'then', 'twodates']],
 ['2015-12-31', []],
 ['2016-01-01', ['stuff happened', 'details']],
 ['2016-01-02', ['more stuff happened', 'details', 'report']],
 ['2016-01-03', []]]

Answer 2

尝试：

def split_by_date(arr, patt='\d+\-\d+\-\d+'):
    results = []
    srch = re.compile(patt)
    rec = ['', []]
    for item in arr:
        if srch.match(item):
            if rec[0] or rec[1]:
                results.append(rec)
            rec = [item, []]
        else:
            rec[1].append(item)
    if rec[0] or rec[1]:
        results.append(rec)
    return results

然后：

normal_case = ['2016-01-01', 'stuff happened', 'details', 
               '2016-01-02', 'more stuff happened', 'details', 'report']
special_case_1 = ['blah', 'blah', 'stuff', '2016-11-11']
special_case_2 = ['blah', 'blah', '2015/01/01', 'blah', 'blah']

print(split_by_date(normal_case))
print(split_by_date(special_case_1))
print(split_by_date(special_case_2, '\d+\/\d+\/\d+'))

给出索引开始拆分python列表

2 个答案: