如何为此文本构造正则表达式

时间:2013-04-18 08:02:25

标签: python regex

这是输入:

7. Data 1 1. STR1 STR2 3. 12345 4. 0876 9. NO 2 1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 1000 ZŁ 12. NO PRub. 1 1. 1000 XX 2. NO 3 1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 ZŁ 12. NO PRub. 1 1. 1000 XX 2. NO 4 1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO 0 1.

这是预期的输出:

[('1', '1. STR1 STR2 3. 12345 4. 0876 9. NO'),
('2', '1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 1000 ZŁ 12. NO PRub. 1 1. 1000 XX 2. NO'),
('3', '1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 ZŁ 12. NO PRub. 1 1. 1000 XX 2. NO'),
('4', '1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO')]

我试过这个:

re.findall(r'(?=\s(\d+)\s(1\..*?)\s\d+\s1\.)', txt, re.DOTALL)

但当然这不是正确的解决方案 - 正则表达式必须匹配(\d+) 1.而不是PRub. 1 1.
我该怎么做才能让它发挥作用?

1 个答案:

答案 0 :(得分:4)

这是怎么回事:

In [1]: s='7. Data 1 1. STR1 STR2 3. 12345 4. 0876 9. NO 2 1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 1000 ZŁ 12. NO PRub. 1 1. 1000 XX 2. NO 3 1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 ZŁ 12. NO PRub. 1 1. 1000 XX 2. NO 4 1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO 0 1.'

In [2]: import re

In [3]: re.findall('(?<=\s)\d.*?(?=\s\d\s\d[.](?=$|\s[A-Z]))',s)
Out[3]: 
['1 1. STR1 STR2 3. 12345 4. 0876 9. NO',
 '2 1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 1000 Z\xc5\x81 12. NO PRub. 1 1. 1000 XX 2. NO',
 '3 1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 Z\xc5\x81 12. NO PRub. 1 1. 1000 XX 2. NO',
 '4 1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO']

对于你的确切输出,我会做类似的事情:

In [4]: ns = re.findall('(?<=\s)\d.*?(?=\s\d\s\d[.](?=$|\s[A-Z]))',s)

In [5]: [tuple(f.split(' ',1)) for f in ns]
Out[5]: 
[('1', '1. STR1 STR2 3. 12345 4. 0876 9. NO'),
 ('2', '1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 1000 Z\xc5\x81 12. NO PRub. 1 1. 1000 XX 2. NO'),
 ('3', '1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 Z\xc5\x81 12. NO PRub. 1 1. 1000 XX 2. NO'),
 ('4', '1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO')]

可能是一个更好的方法来做到这一点,但我的python foo不如我的正则表达式foo。

<强> Regexplanation:

(?<=\s) # Use positive look-behind to match a leading space but don't include it
\d      # match digit    
.*?     # Match everything up till the next record (lazy)
        # The following positive look-behinds is the key. It matches the start of
        # each new record i.e
        # 2 1. S
        # 3 1. S
        # 4 1. Q
        # 0 1.$ 
        # look-arounds match but don't seek past.  
(?=\s\d\s\d[.](?=$|\s[A-Z]))
(?=     # positive look-ahead 1
\s      # space
\d      # digit
\s      # space
\d      # digit
[.]     # period
(?=     # postive look-ahead 2 
$       # end of string
|       # OR
\s[A-Z] # space followed by uppercase letter
)       # close look-ahead 1
)       # close look-ahead 2