最近的一个项目让我需要将传入的短语(作为字符串)分成组成句子。例如,这个字符串:
"Your mother was a hamster, and your father smelt of elderberries! Now go away, or I shall taunt you a second time. You know what, never mind. This entire sentence is far too silly. Wouldn't you agree? I think it is."
需要转换为由以下元素组成的列表:
["Your mother was a hamster, and your father smelt of elderberries",
"Now go away, or I shall taunt you a second time",
"You know what, never mind",
"This entire sentence is far too silly",
"Wouldn't you agree",
"I think it is"]
出于这个功能的目的,一个"句子"是由!
,?
或.
终止的字符串请注意,应从输出中删除标点符号,如上所示。
我有一个工作版本,但它很丑陋,留下了前导和尾随空格,我无法帮助,但认为有更好的方法:
from functools import reduce
def split_sentences(st):
if type(st) is not str:
raise TypeError("Cannot split non-strings")
sl = st.split('.')
sl = [s.split('?') for s in sl]
sl = reduce(lambda x, y: x+y, sl) #Flatten the list
sl = [s.split('!') for s in sl]
return reduce(lambda x, y: x+y, sl)
答案 0 :(得分:8)
使用re.split
来指定匹配任何句子结尾字符(以及任何后续空格)的正则表达式。
def split_sentences(st):
sentences = re.split(r'[.?!]\s*', st)
if sentences[-1]:
return sentences
else:
return sentences[:-1]
答案 1 :(得分:1)
您也可以在没有正则表达式的情况下执行此操作:
result = [s.strip() for s in String.replace('!', '.').replace('?', '.').split('.')]
或者,您可以编写一个不会复制数据的前沿算法:
String = list(String)
for i in range(len(String)):
if (String[i] == '?') or (String[i] == '!'):
String[i] = '.'
String = [s.strip() for s in String.split('.')]
答案 2 :(得分:1)
import re
st1 = " Another example!! Let me contribute 0.50 cents here?? \
How about pointer '.' character inside the sentence? \
Uni Mechanical Pencil Kurutoga, Blue, 0.3mm (M310121P.33). \
Maybe there could be a multipoint delimeter?.. Just maybe... "
st2 = "One word"
def split_sentences(st):
st = st.strip() + '. '
sentences = re.split(r'[.?!][.?!\s]+', st)
return sentences[:-1]
print(split_sentences(st1))
print(split_sentences(st2))
答案 3 :(得分:0)
您可以使用正则表达式split
将它们拆分为特定的特殊字符。
import re
str = "Your mother was a hamster, and your father smelt of elderberries! Now go away, or I shall taunt you a second time. You know what, never mind. This entire sentence is far too silly. Wouldn't you agree? I think it is."
re.compile(r'[?.!]\s+').split(str)