Python re.split()将分隔符的一部分保留为第一个字符串的一部分,其他部分作为第二个字符串的一部分等

时间:2017-09-14 19:10:30

标签: python regex string split

我有一种情况,我想将一长串文本拆分成句子。我有一段代码可以按照我的意愿分割字符串,但它会删除分隔符(我知道它会这样)。现在,我希望能够将这些分隔符保留为输出字符串的一部分(适当地重新分配)。

我的例子就是这样:

import re

strings = ['UT Arlington 1st - Berthiaume reached on a fielding error by ss (0-0). O. Salinas fouled out to 1b (2-1 KBB). Q. Rohrbaugh flied out to cf (2-0 BB). B. Cox fouled out to lf (2-2 KBBKF)',
'Southeast Mo. State 1st - EZELL, T. lined out to 2b (2-2 FBBKFFF). HOLST, D. flied out to lf (0-2 FK). GAGAN, T. struck out swinging (1-2 BKKS).',
'UT Arlington 3rd - J. Minjarez hit by pitch (0-0); RJ Williams advanced to second. Berthiaume popped up to 1b (0-2 KF). O. Salinas flied out to cf to right center (2-1 KBB); RJ Williams advanced to third.']

for s in strings:
        header = re.split(r'[ ][-][ ]', s)
        print(header[0])
        text = re.split(r'([a-z][.][ ][A-Z]|[)][.][ ][A-Z])', header[-1])
        print(text)

当前输出:

UT Arlington 1st
['Berthiaume reached on a fielding error by ss (0-0', '). O', '. Salinas fouled out to 1b (2-1 KBB', '). Q', '. Rohrbaugh flied out to cf (2-0 BB', '). B', '. Cox fouled out to lf (2-2 KBBKF)']
Southeast Mo. State 1st
['EZELL, T. lined out to 2b (2-2 FBBKFFF', '). H', 'OLST, D. flied out to lf (0-2 FK', '). G', 'AGAN, T. struck out swinging (1-2 BKKS).']
UT Arlington 3rd
['J. Minjarez hit by pitch (0-0); RJ Williams advanced to secon', 'd. B', 'erthiaume popped up to 1b (0-2 KF', '). O', '. Salinas flied out to cf to right center (2-1 KBB); RJ Williams advanced to third.']

我想要的输出:

UT Arlington 1st
['Berthiaume reached on a fielding error by ss (0-0)', 'O. Salinas fouled out to 1b (2-1 KBB)', 'Q. Rohrbaugh flied out to cf (2-0 BB)', 'B. Cox fouled out to lf (2-2 KBBKF)']
Southeast Mo. State 1st
['EZELL, T. lined out to 2b (2-2 FBBKFFF)', 'HOLST, D. flied out to lf (0-2 FK)', 'GAGAN, T. struck out swinging (1-2 BKKS).']
UT Arlington 3rd
['J. Minjarez hit by pitch (0-0); RJ Williams advanced to second', 'Berthiaume popped up to 1b (0-2 KF)', 'O. Salinas flied out to cf to right center (2-1 KBB); RJ Williams advanced to third.']

3 个答案:

答案 0 :(得分:3)

您可能希望查看nltk

,而不是使用正则表达式
from nltk import sent_tokenize

strings = ['UT Arlington 1st - Berthiaume reached on a fielding error by ss (0-0). O. Salinas fouled out to 1b (2-1 KBB). Q. Rohrbaugh flied out to cf (2-0 BB). B. Cox fouled out to lf (2-2 KBBKF)',
'Southeast Mo. State 1st - EZELL, T. lined out to 2b (2-2 FBBKFFF). HOLST, D. flied out to lf (0-2 FK). GAGAN, T. struck out swinging (1-2 BKKS).',
'UT Arlington 3rd - J. Minjarez hit by pitch (0-0); RJ Williams advanced to second. Berthiaume popped up to 1b (0-2 KF). O. Salinas flied out to cf to right center (2-1 KBB); RJ Williams advanced to third.']

needle = " - "
for string in strings:
    pos = string.find(needle)
    header = string[:pos]
    text = string[pos + len(needle):]
    print(header)   
    print(sent_tokenize(text))

哪个收益率:

UT Arlington 1st
['Berthiaume reached on a fielding error by ss (0-0).', 'O. Salinas fouled out to 1b (2-1 KBB).', 'Q. Rohrbaugh flied out to cf (2-0 BB).', 'B. Cox fouled out to lf (2-2 KBBKF)']
Southeast Mo. State 1st
['EZELL, T. lined out to 2b (2-2 FBBKFFF).', 'HOLST, D. flied out to lf (0-2 FK).', 'GAGAN, T. struck out swinging (1-2 BKKS).']
UT Arlington 3rd
['J. Minjarez hit by pitch (0-0); RJ Williams advanced to second.', 'Berthiaume popped up to 1b (0-2 KF).', 'O. Salinas flied out to cf to right center (2-1 KBB); RJ Williams advanced to third.']

标题是通过字符串函数(.find())提取的,然后通过sent_tokenize()对句子进行分析。

答案 1 :(得分:3)

答案

好的,所以这适用于你提出的所有用例,但绝不是完美的。句子中间出现.句并发症。这使得它变得复杂,因为它们不再是正常的句子终结符,而是代表其他,例如初始。

代码

You can see this code in use here

\h*+(.{2,}?(?:\.|$))(?=(?:\h+[A-Z])|$)

结果

输入1

J. Minjarez hit by pitch (0-0); RJ Williams advanced to second. Berthiaume popped up to 1b (0-2 KF). O. Salinas flied out to cf to right center (2-1 KBB); RJ Williams advanced to third.

输出1

J. Minjarez hit by pitch (0-0); RJ Williams advanced to second.

Berthiaume popped up to 1b (0-2 KF).

O. Salinas flied out to cf to right center (2-1 KBB); RJ Williams advanced to third.

输入2

EZELL, T. lined out to 2b (2-2 FBBKFFF). HOLST, D. flied out to lf (0-2 FK). GAGAN, T. struck out swinging (1-2 BKKS).

输出2

EZELL, T. lined out to 2b (2-2 FBBKFFF).

HOLST, D. flied out to lf (0-2 FK).

GAGAN,T。摆脱了摇摆(1-2 BKKS)。

输入3

Berthiaume reached on a fielding error by ss (0-0). O. Salinas fouled out to 1b (2-1 KBB). Q. Rohrbaugh flied out to cf (2-0 BB). B. Cox fouled out to lf (2-2 KBBKF)

输出3

Berthiaume reached on a fielding error by ss (0-0).

O. Salinas fouled out to 1b (2-1 KBB).

Q. Rohrbaugh flied out to cf (2-0 BB).

说明

正则表达式的工作原理如下:

  • 尽可能多次在零和无限制的水平空白\h字符之间进行匹配,而不回馈
  • 在2和任意字符(新行除外)之间捕获,但尽可能少,后跟.或字符串结尾$
  • 确保前一个符号之后是以下之一
    • 一个到无限制的水平空白字符,后跟大写字母[A-Z]
    • 字符串$
    • 的结尾

我使用.{2,}?的原因是指定我们要匹配至少2个字符(首字母只有.之前的1个字符,因此这些将被忽略为案例中的句子例如B. Cox)。它使用延迟量词,以便在下一个令牌匹配时停止(点\. [或字符串$的结尾])

修改

由于python的re模块不支持占有量词(并且根据regex101似乎也不支持\h作为水平空白字符,我已经略微编辑了正则表达式,如下所示。

See this code in use here

\s*(\S.{1,}?(?:\.|$))(?=(?:\s+[A-Z])|$)

答案 2 :(得分:1)

由于每个句子都以球和击球的当前计数结束,因此当句点后面有-时,您可以在.)上分开。此外,正则表达式检查以查看句点之前的最后一个字母是否为小写,并且后面的数据是空格,然后是大写字母(表示常规句子的结尾和新句子的开头):

import re

strings = ['UT Arlington 1st - Berthiaume reached on a fielding error  by ss (0-0). O. Salinas fouled out to 1b (2-1 KBB). Q. Rohrbaugh flied out to cf (2-0 BB). B. Cox fouled out to lf (2-2 KBBKF)', 'Southeast Mo. State 1st - EZELL, T. lined out to 2b (2-2 FBBKFFF). HOLST, D. flied out to lf (0-2 FK). GAGAN, T. struck out swinging (1-2 BKKS).', 'UT Arlington 3rd - J. Minjarez hit by pitch (0-0); RJ Williams advanced to second. Berthiaume popped up to 1b (0-2 KF). O. Salinas flied out to cf to right center (2-1 KBB); RJ Williams advanced to third.']

new_data = [re.split("(?<!\d)-(?!\d)|(?<=\))\.|(?<=[a-z])\.(?=\s[A-Z])", i) for i in strings]

for plays in new_data:
    print new_data