拆分包含数字,罗马数字和项目符号列表的字符串的最佳方法是什么?

时间:2018-09-28 03:27:04

标签: python regex split

我正在尝试分解一个字符串,其中包含多个具有不同格式的列表。最好的方法是什么?

string = "something here: 1) A i) great ii) awesome 2) B"

another_string = "But sometimes it is different (1) yep (2) not the same i. or this ii. another bullet (3.1) getting difficult huh? 3.1.1 okay i'm done"

理想情况下,我希望能够拆分所有可能的编号或项目符号列表。

所需的字符串输出:

something here: 1) A 
i) great 
ii) awesome 
2) B

another_string的所需输出:

But sometimes it is different (1) yep
(2) not the same
i. or this 
ii. another bullet
(3.1) getting difficult huh?
3.1.1 okay i'm done

1 个答案:

答案 0 :(得分:1)

您可以将re.split与以下正则表达式(从paxdiablo借来的罗马数字正则表达式)一起使用,以分割输入字符串,然后将其与迭代器连接起来:

import re
def split(s):
    i = iter(re.split(r'(\(?\d+(?:\.\d+)+\)?|\(?\d+\)|\(?\b(?=M|(?:CM|CD|D?C)|(?:XC|XL|L?X)|(?:IX|IV|V?I))M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3})[.)])', s, flags=re.IGNORECASE))
    return next(i) + '\n'.join(map(''.join, zip(i, i)))

以使您的示例输入:

split(string)

将返回:

something here: 1) A 
i) great 
ii) awesome 
2) B

和:

split(another_string)

将返回:

But sometimes it is different (1) yep 
(2) not the same 
i. or this 
ii. another bullet 
(3.1) getting difficult huh? 
3.1.1 okay i'm done