假设我有一个字符串:
string1 = 'Pernod has reduced the debt it took on to fund the Seagram purchase to just 1.8bn euros, while Allied has improved the performance of its fast-food chains.Shares in UK drinks and food firm Allied Domecq have risen on speculation that it could be the target of a takeover by France's Pernod Ricard.'
我要处理大量的文章,在这些文章中,时间段后并不总是留有空格,但有些却可以。如何将文本拆分为句子而不拆分小数? TIA。
答案 0 :(得分:0)
这可以使用正则表达式re.split()来实现,假设声明性句子的末尾没有数字,并且后面是一个以数字开头的句子,且句子之间没有空格(例如,“这是我的句子”我的下一个句子的开头是1.2。”;第一个句子以“ 1.”结尾,第二个句子以“ 2”开头。
也就是说,单独的split()将无法执行所需的操作。还值得注意的是,因为撇号比引号更常见,所以用引号分隔字符串可能会更好。就目前而言,句子的结尾“ Pernod Richard。”不被视为字符串的一部分,因此被视为无效语法。
string1 = "Pernod has reduced the debt it took on to fund the Seagram purchase to just 1.8bn euros, while Allied has improved the performance of its fast-food chains.Shares in UK drinks and food firm Allied Domecq have risen on speculation that it could be the target of a takeover by France's Pernod Ricard."
sentences = re.split('[^0-9]["."][^0-9]', string1)
答案 1 :(得分:0)
执行此操作的一种方法是保护不希望分割文本的点,方法是先将它们替换为其他内容,然后在分割后再次替换占位符:
import re
# replace dots that have numbers around them with "[PROTECTED_DOT]"
string1_protected = re.sub(r"(\d)\.(\d)", r"\1[PROTECTED_DOT]\2", string1)
# now split (and remove empty lines)
lines_protected = [line + "." for line in string1_protected.split(".") if line]
# now re-replace all "[PROTECTED_DOT]"s
lines = [line.replace("[PROTECTED_DOT]", ".") for line in lines_protected]
结果:
In [1]: lines
Out[1]: ['Pernod has reduced the debt it took on to fund the Seagram purchase to just 1.8bn euros, while Allied has improved the performance of its fast-food chains.',
"Shares in UK drinks and food firm Allied Domecq have risen on speculation that it could be the target of a takeover by France's Pernod Ricard."]