如何通过此捕获组划分我的输入?

时间:2015-11-22 18:24:48

标签: python regex

对于这个正则表达式:

(?<!Mr|Dr|Ms|Jr|Sr)[.?!]+(\s)[A-Z0-9]

我希望输入字符串被捕获的匹配\s字符拆分 - 绿色匹配as seen over here

但是,当我运行时:

import re

p = re.compile(ur'(?<!Mr|Dr|Ms|Jr|Sr)[.?!]+(\s)[A-Z0-9]')

test_str = u"Mr. Smith bought cheapsite.com for 1.5 million dollars i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.23 is the ish. My name is! Why wouldn't you... this is.\nAndrew"

re.split(p, test_str)

似乎将字符串拆分为[.?!]+[A-Z0-9]给出的区域(因此错误地省略了它们),并在结果中留下\s

澄清:

输入he paid a lot for it. Did he mind

收到输出['he paid a lot for it','\s','id he mind']

预期输出['he paid a lot for it.','Did he mind']

1 个答案:

答案 0 :(得分:1)

您需要从(\s)左右删除捕获组,并将最后一个字符类放入预测中以将其从匹配项中排除:

p = re.compile(ur'(?<!Mr|Dr|Ms|Jr|Sr)[.?!]+\s(?=[A-Z0-9])')
#                                          ^^^^^        ^
test_str = u"Mr. Smith bought cheapsite.com for 1.5 million dollars i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.23 is the ish. My name is! Why wouldn't you... this is.\nAndrew"
print(p.split(test_str))

请参阅IDEONE demothe regex demo

正则表达式模式中的任何捕获组都会在re.split期间在结果数组中创建一个额外的元素。

要强制标点符号显示在“句子”中,您可以将此匹配的正则表达式与re.findall一起使用:

import re
p = re.compile(r'\s*((?:(?:Mr|Dr|Ms|Jr|Sr)\.|\.(?!\s+[A-Z0-9])|[^.!?])*[.?!]|[^.!?]+)')
test_str = "Mr. Smith bought cheapsite.com for 1.5 million dollars i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.23 is the ish. My name is! Why wouldn't you... this is.\nAndrew"
print(p.findall(test_str))

请参阅IDEONE demo

结果:

['Mr. Smith bought cheapsite.com for 1.5 million dollars i.e. he paid a lot for it.', 'Did he mind?', "Adam Jones Jr. thinks he didn't.", "In any case, this isn't true...", "Well, with a probability of .9 it isn't.23 is the ish.", 'My name is!', "Why wouldn't you... this is.", 'Andrew']

The regex demo

正则表达式遵循原始模式中的规则:

  • \s* - 匹配0或更多空格以省略结果
  • (?:(?:Mr|Dr|Ms|Jr|Sr)\.|\.(?!\s+[A-Z0-9])|[^.!?])*[.?!]|[^.!?]+) - re.findall抓取并归还的2个替代品:

    • (?:(?:Mr|Dr|Ms|Jr|Sr)\.|\.(?!\s+[A-Z0-9])|[^.!?])* - 0个或更多个序列...
      • (?:Mr|Dr|Ms|Jr|Sr)\. - 缩写标题
      • \.(?!\s+[A-Z0-9]) - 匹配一个后跟一个或多个空格,然后是大写字母或数字的点
      • [^.!?] - 除.!?
      • 之外的任何字符
  • ...或

    • [^.!?]+ - 除.!?
    • 之外的任何一个或多个字符