我看过这个:Split list into sublist based on index ranges
但我的问题略有不同。 我有一个清单
body
我需要根据日期将其拆分为子列表。基本上它是一个事件日志,但是由于糟糕的数据库设计,系统将事件的单独更新消息串联成一个大的字符串列表。 我有:
p
我的例子将给出:
i
现在我需要根据索引将列表拆分为单独的列表。所以对于我的例子,理想情况下我想得到:
DOM
所以格式为:
i
还有一些边缘情况,其中没有日期字符串,其格式为:
i
答案 0 :(得分:1)
您根本不需要执行两次通过分组,因为您可以在一次通过中使用itertools.groupby
按日期及其相关事件进行细分。通过避免计算索引然后使用它们切片list
,您可以处理一次提供一个值的生成器,如果您的输入很大,则可以避免内存问题。为了演示,我已经拍摄了原始的List
并对其进行了扩展,以便正确显示处理边缘情况:
import re
from itertools import groupby
List = ['undated', 'garbage', 'then', 'twodates', '2015-12-31',
'2016-01-01', 'stuff happened', 'details',
'2016-01-02', 'more stuff happened', 'details', 'report',
'2016-01-03']
datere = re.compile(r"\d+\-\d+\-\d+") # Precompile regex for speed
def group_by_date(it):
# Make iterator that groups dates with dates and non-dates with dates
grouped = groupby(it, key=lambda x: datere.match(x) is not None)
for isdate, g in grouped:
if not isdate:
# We had a leading set of undated events, output as undated
yield ['', list(g)]
else:
# At least one date found; iterate with one loop delay
# so final date can have events included (all others have no events)
lastdate = next(g)
for date in g:
yield [lastdate, []]
lastdate = date
# Final date pulls next group (which must be events or the end of the input)
try:
# Get next group of events
events = list(next(grouped)[1])
except StopIteration:
# There were no events for final date
yield [lastdate, []]
else:
# There were events associated with final date
yield [lastdate, events]
print(list(group_by_date(List)))
输出(为了便于阅读而添加了新行):
[['', ['undated', 'garbage', 'then', 'twodates']],
['2015-12-31', []],
['2016-01-01', ['stuff happened', 'details']],
['2016-01-02', ['more stuff happened', 'details', 'report']],
['2016-01-03', []]]
答案 1 :(得分:1)
尝试:
def split_by_date(arr, patt='\d+\-\d+\-\d+'):
results = []
srch = re.compile(patt)
rec = ['', []]
for item in arr:
if srch.match(item):
if rec[0] or rec[1]:
results.append(rec)
rec = [item, []]
else:
rec[1].append(item)
if rec[0] or rec[1]:
results.append(rec)
return results
然后:
normal_case = ['2016-01-01', 'stuff happened', 'details',
'2016-01-02', 'more stuff happened', 'details', 'report']
special_case_1 = ['blah', 'blah', 'stuff', '2016-11-11']
special_case_2 = ['blah', 'blah', '2015/01/01', 'blah', 'blah']
print(split_by_date(normal_case))
print(split_by_date(special_case_1))
print(split_by_date(special_case_2, '\d+\/\d+\/\d+'))