答案

Question

我有一种情况，我想将一长串文本拆分成句子。我有一段代码可以按照我的意愿分割字符串，但它会删除分隔符（我知道它会这样）。现在，我希望能够将这些分隔符保留为输出字符串的一部分（适当地重新分配）。

我的例子就是这样：

import re

strings = ['UT Arlington 1st - Berthiaume reached on a fielding error by ss (0-0). O. Salinas fouled out to 1b (2-1 KBB). Q. Rohrbaugh flied out to cf (2-0 BB). B. Cox fouled out to lf (2-2 KBBKF)',
'Southeast Mo. State 1st - EZELL, T. lined out to 2b (2-2 FBBKFFF). HOLST, D. flied out to lf (0-2 FK). GAGAN, T. struck out swinging (1-2 BKKS).',
'UT Arlington 3rd - J. Minjarez hit by pitch (0-0); RJ Williams advanced to second. Berthiaume popped up to 1b (0-2 KF). O. Salinas flied out to cf to right center (2-1 KBB); RJ Williams advanced to third.']

for s in strings:
        header = re.split(r'[ ][-][ ]', s)
        print(header[0])
        text = re.split(r'([a-z][.][ ][A-Z]|[)][.][ ][A-Z])', header[-1])
        print(text)

当前输出：

UT Arlington 1st
['Berthiaume reached on a fielding error by ss (0-0', '). O', '. Salinas fouled out to 1b (2-1 KBB', '). Q', '. Rohrbaugh flied out to cf (2-0 BB', '). B', '. Cox fouled out to lf (2-2 KBBKF)']
Southeast Mo. State 1st
['EZELL, T. lined out to 2b (2-2 FBBKFFF', '). H', 'OLST, D. flied out to lf (0-2 FK', '). G', 'AGAN, T. struck out swinging (1-2 BKKS).']
UT Arlington 3rd
['J. Minjarez hit by pitch (0-0); RJ Williams advanced to secon', 'd. B', 'erthiaume popped up to 1b (0-2 KF', '). O', '. Salinas flied out to cf to right center (2-1 KBB); RJ Williams advanced to third.']

我想要的输出：

UT Arlington 1st
['Berthiaume reached on a fielding error by ss (0-0)', 'O. Salinas fouled out to 1b (2-1 KBB)', 'Q. Rohrbaugh flied out to cf (2-0 BB)', 'B. Cox fouled out to lf (2-2 KBBKF)']
Southeast Mo. State 1st
['EZELL, T. lined out to 2b (2-2 FBBKFFF)', 'HOLST, D. flied out to lf (0-2 FK)', 'GAGAN, T. struck out swinging (1-2 BKKS).']
UT Arlington 3rd
['J. Minjarez hit by pitch (0-0); RJ Williams advanced to second', 'Berthiaume popped up to 1b (0-2 KF)', 'O. Salinas flied out to cf to right center (2-1 KBB); RJ Williams advanced to third.']

Answer 1

您可能希望查看nltk：

，而不是使用正则表达式

from nltk import sent_tokenize

strings = ['UT Arlington 1st - Berthiaume reached on a fielding error by ss (0-0). O. Salinas fouled out to 1b (2-1 KBB). Q. Rohrbaugh flied out to cf (2-0 BB). B. Cox fouled out to lf (2-2 KBBKF)',
'Southeast Mo. State 1st - EZELL, T. lined out to 2b (2-2 FBBKFFF). HOLST, D. flied out to lf (0-2 FK). GAGAN, T. struck out swinging (1-2 BKKS).',
'UT Arlington 3rd - J. Minjarez hit by pitch (0-0); RJ Williams advanced to second. Berthiaume popped up to 1b (0-2 KF). O. Salinas flied out to cf to right center (2-1 KBB); RJ Williams advanced to third.']

needle = " - "
for string in strings:
    pos = string.find(needle)
    header = string[:pos]
    text = string[pos + len(needle):]
    print(header)   
    print(sent_tokenize(text))

哪个收益率：

UT Arlington 1st
['Berthiaume reached on a fielding error by ss (0-0).', 'O. Salinas fouled out to 1b (2-1 KBB).', 'Q. Rohrbaugh flied out to cf (2-0 BB).', 'B. Cox fouled out to lf (2-2 KBBKF)']
Southeast Mo. State 1st
['EZELL, T. lined out to 2b (2-2 FBBKFFF).', 'HOLST, D. flied out to lf (0-2 FK).', 'GAGAN, T. struck out swinging (1-2 BKKS).']
UT Arlington 3rd
['J. Minjarez hit by pitch (0-0); RJ Williams advanced to second.', 'Berthiaume popped up to 1b (0-2 KF).', 'O. Salinas flied out to cf to right center (2-1 KBB); RJ Williams advanced to third.']

标题是通过字符串函数（.find()）提取的，然后通过sent_tokenize()对句子进行分析。

Answer 2

答案

简

好的，所以这适用于你提出的所有用例，但绝不是完美的。句子中间出现.句并发症。这使得它变得复杂，因为它们不再是正常的句子终结符，而是代表其他，例如初始。

代码

You can see this code in use here

\h*+(.{2,}?(?:\.|$))(?=(?:\h+[A-Z])|$)

结果

输入1

J. Minjarez hit by pitch (0-0); RJ Williams advanced to second. Berthiaume popped up to 1b (0-2 KF). O. Salinas flied out to cf to right center (2-1 KBB); RJ Williams advanced to third.

输出1

J. Minjarez hit by pitch (0-0); RJ Williams advanced to second.

Berthiaume popped up to 1b (0-2 KF).

O. Salinas flied out to cf to right center (2-1 KBB); RJ Williams advanced to third.

输入2

EZELL, T. lined out to 2b (2-2 FBBKFFF). HOLST, D. flied out to lf (0-2 FK). GAGAN, T. struck out swinging (1-2 BKKS).

输出2

EZELL, T. lined out to 2b (2-2 FBBKFFF).

HOLST, D. flied out to lf (0-2 FK).

GAGAN，T。摆脱了摇摆（1-2 BKKS）。

输入3

Berthiaume reached on a fielding error by ss (0-0). O. Salinas fouled out to 1b (2-1 KBB). Q. Rohrbaugh flied out to cf (2-0 BB). B. Cox fouled out to lf (2-2 KBBKF)

输出3

Berthiaume reached on a fielding error by ss (0-0).

O. Salinas fouled out to 1b (2-1 KBB).

Q. Rohrbaugh flied out to cf (2-0 BB).

说明

正则表达式的工作原理如下：

尽可能多次在零和无限制的水平空白\h字符之间进行匹配，而不回馈
在2和任意字符（新行除外）之间捕获，但尽可能少，后跟.或字符串结尾$
确保前一个符号之后是以下之一
- 一个到无限制的水平空白字符，后跟大写字母[A-Z]
- 字符串$

我使用.{2,}?的原因是指定我们要匹配至少2个字符（首字母只有.之前的1个字符，因此这些将被忽略为案例中的句子例如B. Cox）。它使用延迟量词，以便在下一个令牌匹配时停止（点\. [或字符串$的结尾]）

修改

由于python的re模块不支持占有量词（并且根据regex101似乎也不支持\h作为水平空白字符，我已经略微编辑了正则表达式，如下所示。

See this code in use here

\s*(\S.{1,}?(?:\.|$))(?=(?:\s+[A-Z])|$)

Answer 3

由于每个句子都以球和击球的当前计数结束，因此当句点后面有-时，您可以在.或)上分开。此外，正则表达式检查以查看句点之前的最后一个字母是否为小写，并且后面的数据是空格，然后是大写字母（表示常规句子的结尾和新句子的开头）：

import re

strings = ['UT Arlington 1st - Berthiaume reached on a fielding error  by ss (0-0). O. Salinas fouled out to 1b (2-1 KBB). Q. Rohrbaugh flied out to cf (2-0 BB). B. Cox fouled out to lf (2-2 KBBKF)', 'Southeast Mo. State 1st - EZELL, T. lined out to 2b (2-2 FBBKFFF). HOLST, D. flied out to lf (0-2 FK). GAGAN, T. struck out swinging (1-2 BKKS).', 'UT Arlington 3rd - J. Minjarez hit by pitch (0-0); RJ Williams advanced to second. Berthiaume popped up to 1b (0-2 KF). O. Salinas flied out to cf to right center (2-1 KBB); RJ Williams advanced to third.']

new_data = [re.split("(?<!\d)-(?!\d)|(?<=\))\.|(?<=[a-z])\.(?=\s[A-Z])", i) for i in strings]

for plays in new_data:
    print new_data

Python re.split（）将分隔符的一部分保留为第一个字符串的一部分，其他部分作为第二个字符串的一部分等

3 个答案:

答案

简

代码

结果

输入1

输出1

输入2

输出2

输入3

输出3

说明

修改