使用python中的正则表达式拆分罗马数字

时间:2015-09-12 22:27:44

标签: python regex

我需要在罗马数字上分割文字。
这是我的文字

This is the part (a) of question number one. i. This is sub part one of part (a) question one ii. This is sub part two of part (a) question one iii. This is sub part three of part (a) question one

实际上这是问题文件问题的一部分。我怎么想把它分解如下。

This is the part (a) of question number one.
This is sub part one of part (a) question one
This is sub part two of part (a) question one
This is sub part three of part (a) question one


所以在这里我想要的是,将句子分成罗马数字。
这是我写的正则表达式

text = This is the part (a) of question number one. i. This is sub part one of part (a) question one ii. This is sub part two of part (a) question one iii. This is sub part three of part (a) question one
for m in re.split(r' [a-z]+\. ',text):
    print(m)


这就是我得到的

This is the part (a) of question number one.
i. This is sub part one of part (a) question one
This is sub part two of part (a) question one
This is sub part three of part (a) question one

我的表情适用于罗马第二和第三,但不是罗马第一。所以我需要一个适合任何罗马数字的一般表达。
重要的是要注意罗马之前号码有一个空格,罗马数字之后有一个句号,然后是一个空格。
有人可以帮我解决这个问题吗?

3 个答案:

答案 0 :(得分:3)

您的正则表达式捕获子串one.,尝试以这种方式更改它:

text = 'This is the part (a) of question number one. i. This is sub part one of part (a) question one ii. This is sub part two of part (a) question one iii. This is sub part three of part (a) question one'

for m in re.split(r' [MDCLXVI]+\. ', text, flags=re.IGNORECASE):
    print(m)

答案 1 :(得分:0)

这不是我得到的。再次检查你的第一行。我得到了

This is the part (a) of question number

因为你的正则表达式匹配“一个”。

re.split(r'i+\. ',text)

适合我。

答案 2 :(得分:0)

如果你想要正确的罗马特数字(小写的罗马数字通常称为'romanette'),它们很容易生成。 Mark Pilgrim在 Dive Into Python 一书中有各种罗马数字实用程序,其中一些可以看作here

产生人数的那个:

class RomanError(Exception): pass
class OutOfRangeError(RomanError): pass
class NotIntegerError(RomanError): pass
class InvalidRomanNumeralError(RomanError): pass

def toRoman(n):
    """convert integer to Roman numeral"""
    if not (0 < n < 5000):
        raise OutOfRangeError, "number out of range (must be 1..4999)"
    if int(n) != n:
        raise NotIntegerError, "decimals can not be converted"
    romanNumeralMap = (('M',  1000), ('CM', 900), ('D',  500), ('CD', 400), ('C',  100), ('XC', 90),
       ('L',  50), ('XL', 40), ('X',  10), ('IX', 9), ('V',  5), ('IV', 4), ('I',  1))
    result = ""
    for numeral, integer in romanNumeralMap:
        while n >= integer:
            result += numeral
            n -= integer
    return result

测试:

>>> [toRoman(x) for x in range(1,21)]
['I', 'II', 'III', 'IV', 'V', 'VI', 'VII', 'VIII', 'IX', 'X', 'XI', 'XII', 'XIII', 'XIV', 'XV', 'XVI', 'XVII', 'XVIII', 'XIX', 'XX']

这可用于生成最多20个罗马数字的模式,并将其放入正则表达式中:

>>> pat=' (?:'+'|'.join([int_to_roman(i).lower() for i in range(1,21)])+')\. '
>>> pat
' (?:i|ii|iii|iv|v|vi|vii|viii|ix|x|xi|xii|xiii|xiv|xv|xvi|xvii|xviii|xix|xx)\\. '

然后你可以分割你的文字:

>>> print '\n'.join(re.split(pat, txt))
This is the part (a) of question number one.
This is sub part one of part (a) question one
This is sub part two of part (a) question one
This is sub part three of part (a) question one

或者,您可以在re.split中使用his regex

>>> pat=re.compile('''\
... [ ]                 # one space
... m{0,4}              # thousands - 0 to 4 M's
... (?:cm|cd|d?c{0,3})  # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's),
...                     #            or 500-800 (D, followed by 0 to 3 C's)
... (?:xc|xl|l?x{0,3})  # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X's),
...                     #        or 50-80 (L, followed by 0 to 3 X's)
... (?:ix|iv|v?i{0,3})  # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I's),
...                     #        or 5-8 (V, followed by 0 to 3 I's)
... [.][ ]                # full stop then a space''', re.X)
>>> print '\n'.join(pat.split(txt))
This is the part (a) of question number one.
This is sub part one of part (a) question one
This is sub part two of part (a) question one
This is sub part three of part (a) question one