我有一种情况,我想将一长串文本拆分成句子。我有一段代码可以按照我的意愿分割字符串,但它会删除分隔符(我知道它会这样)。现在,我希望能够将这些分隔符保留为输出字符串的一部分(适当地重新分配)。
我的例子就是这样:
import re
strings = ['UT Arlington 1st - Berthiaume reached on a fielding error by ss (0-0). O. Salinas fouled out to 1b (2-1 KBB). Q. Rohrbaugh flied out to cf (2-0 BB). B. Cox fouled out to lf (2-2 KBBKF)',
'Southeast Mo. State 1st - EZELL, T. lined out to 2b (2-2 FBBKFFF). HOLST, D. flied out to lf (0-2 FK). GAGAN, T. struck out swinging (1-2 BKKS).',
'UT Arlington 3rd - J. Minjarez hit by pitch (0-0); RJ Williams advanced to second. Berthiaume popped up to 1b (0-2 KF). O. Salinas flied out to cf to right center (2-1 KBB); RJ Williams advanced to third.']
for s in strings:
header = re.split(r'[ ][-][ ]', s)
print(header[0])
text = re.split(r'([a-z][.][ ][A-Z]|[)][.][ ][A-Z])', header[-1])
print(text)
当前输出:
UT Arlington 1st
['Berthiaume reached on a fielding error by ss (0-0', '). O', '. Salinas fouled out to 1b (2-1 KBB', '). Q', '. Rohrbaugh flied out to cf (2-0 BB', '). B', '. Cox fouled out to lf (2-2 KBBKF)']
Southeast Mo. State 1st
['EZELL, T. lined out to 2b (2-2 FBBKFFF', '). H', 'OLST, D. flied out to lf (0-2 FK', '). G', 'AGAN, T. struck out swinging (1-2 BKKS).']
UT Arlington 3rd
['J. Minjarez hit by pitch (0-0); RJ Williams advanced to secon', 'd. B', 'erthiaume popped up to 1b (0-2 KF', '). O', '. Salinas flied out to cf to right center (2-1 KBB); RJ Williams advanced to third.']
我想要的输出:
UT Arlington 1st
['Berthiaume reached on a fielding error by ss (0-0)', 'O. Salinas fouled out to 1b (2-1 KBB)', 'Q. Rohrbaugh flied out to cf (2-0 BB)', 'B. Cox fouled out to lf (2-2 KBBKF)']
Southeast Mo. State 1st
['EZELL, T. lined out to 2b (2-2 FBBKFFF)', 'HOLST, D. flied out to lf (0-2 FK)', 'GAGAN, T. struck out swinging (1-2 BKKS).']
UT Arlington 3rd
['J. Minjarez hit by pitch (0-0); RJ Williams advanced to second', 'Berthiaume popped up to 1b (0-2 KF)', 'O. Salinas flied out to cf to right center (2-1 KBB); RJ Williams advanced to third.']
答案 0 :(得分:3)
您可能希望查看nltk
:
from nltk import sent_tokenize
strings = ['UT Arlington 1st - Berthiaume reached on a fielding error by ss (0-0). O. Salinas fouled out to 1b (2-1 KBB). Q. Rohrbaugh flied out to cf (2-0 BB). B. Cox fouled out to lf (2-2 KBBKF)',
'Southeast Mo. State 1st - EZELL, T. lined out to 2b (2-2 FBBKFFF). HOLST, D. flied out to lf (0-2 FK). GAGAN, T. struck out swinging (1-2 BKKS).',
'UT Arlington 3rd - J. Minjarez hit by pitch (0-0); RJ Williams advanced to second. Berthiaume popped up to 1b (0-2 KF). O. Salinas flied out to cf to right center (2-1 KBB); RJ Williams advanced to third.']
needle = " - "
for string in strings:
pos = string.find(needle)
header = string[:pos]
text = string[pos + len(needle):]
print(header)
print(sent_tokenize(text))
哪个收益率:
UT Arlington 1st
['Berthiaume reached on a fielding error by ss (0-0).', 'O. Salinas fouled out to 1b (2-1 KBB).', 'Q. Rohrbaugh flied out to cf (2-0 BB).', 'B. Cox fouled out to lf (2-2 KBBKF)']
Southeast Mo. State 1st
['EZELL, T. lined out to 2b (2-2 FBBKFFF).', 'HOLST, D. flied out to lf (0-2 FK).', 'GAGAN, T. struck out swinging (1-2 BKKS).']
UT Arlington 3rd
['J. Minjarez hit by pitch (0-0); RJ Williams advanced to second.', 'Berthiaume popped up to 1b (0-2 KF).', 'O. Salinas flied out to cf to right center (2-1 KBB); RJ Williams advanced to third.']
标题是通过字符串函数(.find()
)提取的,然后通过sent_tokenize()
对句子进行分析。
答案 1 :(得分:3)
好的,所以这适用于你提出的所有用例,但绝不是完美的。句子中间出现.
句并发症。这使得它变得复杂,因为它们不再是正常的句子终结符,而是代表其他,例如初始。
You can see this code in use here
\h*+(.{2,}?(?:\.|$))(?=(?:\h+[A-Z])|$)
J. Minjarez hit by pitch (0-0); RJ Williams advanced to second. Berthiaume popped up to 1b (0-2 KF). O. Salinas flied out to cf to right center (2-1 KBB); RJ Williams advanced to third.
J. Minjarez hit by pitch (0-0); RJ Williams advanced to second.
Berthiaume popped up to 1b (0-2 KF).
O. Salinas flied out to cf to right center (2-1 KBB); RJ Williams advanced to third.
EZELL, T. lined out to 2b (2-2 FBBKFFF). HOLST, D. flied out to lf (0-2 FK). GAGAN, T. struck out swinging (1-2 BKKS).
EZELL, T. lined out to 2b (2-2 FBBKFFF).
HOLST, D. flied out to lf (0-2 FK).
GAGAN,T。摆脱了摇摆(1-2 BKKS)。
Berthiaume reached on a fielding error by ss (0-0). O. Salinas fouled out to 1b (2-1 KBB). Q. Rohrbaugh flied out to cf (2-0 BB). B. Cox fouled out to lf (2-2 KBBKF)
Berthiaume reached on a fielding error by ss (0-0).
O. Salinas fouled out to 1b (2-1 KBB).
Q. Rohrbaugh flied out to cf (2-0 BB).
正则表达式的工作原理如下:
\h
字符之间进行匹配,而不回馈.
或字符串结尾$
[A-Z]
$
我使用.{2,}?
的原因是指定我们要匹配至少2个字符(首字母只有.
之前的1个字符,因此这些将被忽略为案例中的句子例如B. Cox
)。它使用延迟量词,以便在下一个令牌匹配时停止(点\.
[或字符串$
的结尾])
由于python的re
模块不支持占有量词(并且根据regex101似乎也不支持\h
作为水平空白字符,我已经略微编辑了正则表达式,如下所示。
\s*(\S.{1,}?(?:\.|$))(?=(?:\s+[A-Z])|$)
答案 2 :(得分:1)
由于每个句子都以球和击球的当前计数结束,因此当句点后面有-
时,您可以在.
或)
上分开。此外,正则表达式检查以查看句点之前的最后一个字母是否为小写,并且后面的数据是空格,然后是大写字母(表示常规句子的结尾和新句子的开头):
import re
strings = ['UT Arlington 1st - Berthiaume reached on a fielding error by ss (0-0). O. Salinas fouled out to 1b (2-1 KBB). Q. Rohrbaugh flied out to cf (2-0 BB). B. Cox fouled out to lf (2-2 KBBKF)', 'Southeast Mo. State 1st - EZELL, T. lined out to 2b (2-2 FBBKFFF). HOLST, D. flied out to lf (0-2 FK). GAGAN, T. struck out swinging (1-2 BKKS).', 'UT Arlington 3rd - J. Minjarez hit by pitch (0-0); RJ Williams advanced to second. Berthiaume popped up to 1b (0-2 KF). O. Salinas flied out to cf to right center (2-1 KBB); RJ Williams advanced to third.']
new_data = [re.split("(?<!\d)-(?!\d)|(?<=\))\.|(?<=[a-z])\.(?=\s[A-Z])", i) for i in strings]
for plays in new_data:
print new_data