我有一个带有大文本的字符串,需要将其拆分为多个长度小于等于N个字符的子字符串(尽可能接近N; N始终大于最大的句子),但是我也不需要打破句子。
例如,如果我有N = 80并给出了文字:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer in tellus quam. Nam sit amet iaculis lacus, non sagittis nulla. Nam blandit quam eget velit maximus, eu consectetur sapien sodales. Etiam efficitur blandit arcu, quis rhoncus mauris elementum vel.
我想获取字符串列表:
"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer in tellus quam."
"Nam sit amet iaculis lacus, non sagittis nulla."
"Nam blandit quam eget velit maximus, eu consectetur sapien sodales."
"Etiam efficitur blandit arcu, quis rhoncus mauris elementum vel."
我也希望它能与英语和俄语一起使用。
如何实现?
答案 0 :(得分:1)
我找不到可用于此目的的内置函数,因此这是一个开始。您可以通过检查和之前而不是之前的位置将句子移动到何处来使其更智能。长度包括空格,因为我是天真的分割而不是使用正则表达式之类的东西。
def get_sentences(text, min_length):
sentences = (sentence + ". "
for sentence in text.split(". "))
current_line = ""
for sentence in sentences:
if len(current_line >= min_length):
yield current_line
current_line = sentence
else:
current_line += sentence
yield current_line
行很长很慢,但是确实可以。
答案 1 :(得分:1)
我要采取的步骤:
line
变量来存储当前行的字符串。.split
上'.'
,删除结尾的空句子(""
),去除开头和结尾的空格(.strip
)然后再加上句号。因此,在Python中,类似:
para = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer in tellus quam. Nam sit amet iaculis lacus, non sagittis nulla. Nam blandit quam eget velit maximus, eu consectetur sapien sodales. Etiam efficitur blandit arcu, quis rhoncus mauris elementum vel."
lines = []
line = ''
for sentence in (s.strip()+'.' for s in para.split('.')[:-1]):
if len(line) + len(sentence) + 1 >= 80: #can't fit on that line => start new one
lines.append(line)
line = sentence
else: #can fit on => add a space then this sentence
line += ' ' + sentence
将lines
指定为:
[
"Lorem ipsum dolor sit amet, consectetur adipiscing elit.Integer in tellus quam.",
"Nam sit amet iaculis lacus, non sagittis nulla.",
"Nam blandit quam eget velit maximus, eu consectetur sapien sodales."
]