在Python中,如何在句点后分割字符串而不影响十进制数字?

时间:2019-01-25 15:26:12

标签: python-3.x split

假设我有一个字符串:

string1 = 'Pernod has reduced the debt it took on to fund the Seagram purchase to just 1.8bn euros, while Allied has improved the performance of its fast-food chains.Shares in UK drinks and food firm Allied Domecq have risen on speculation that it could be the target of a takeover by France's Pernod Ricard.'

我要处理大量的文章,在这些文章中,时间段后并不总是留有空格,但有些却可以。如何将文本拆分为句子而不拆分小数? TIA。

2 个答案:

答案 0 :(得分:0)

这可以使用正则表达式re.split()来实现,假设声明性句子的末尾没有数字,并且后面是一个以数字开头的句子,且句子之间没有空格(例如,“这是我的句子”我的下一个句子的开头是1.2。”;第一个句子以“ 1.”结尾,第二个句子以“ 2”开头。

也就是说,单独的split()将无法执行所需的操作。还值得注意的是,因为撇号比引号更常见,所以用引号分隔字符串可能会更好。就目前而言,句子的结尾“ Pernod Richard。”不被视为字符串的一部分,因此被视为无效语法。

string1 = "Pernod has reduced the debt it took on to fund the Seagram purchase to just 1.8bn euros, while Allied has improved the performance of its fast-food chains.Shares in UK drinks and food firm Allied Domecq have risen on speculation that it could be the target of a takeover by France's Pernod Ricard."

sentences = re.split('[^0-9]["."][^0-9]', string1)

答案 1 :(得分:0)

执行此操作的一种方法是保护不希望分割文本的点,方法是先将它们替换为其他内容,然后在分割后再次替换占位符:

import re
# replace dots that have numbers around them with "[PROTECTED_DOT]"
string1_protected = re.sub(r"(\d)\.(\d)", r"\1[PROTECTED_DOT]\2", string1)  
# now split (and remove empty lines)
lines_protected = [line + "." for line in string1_protected.split(".") if line]   
# now re-replace all "[PROTECTED_DOT]"s
lines = [line.replace("[PROTECTED_DOT]", ".") for line in lines_protected]

结果:

In [1]: lines
Out[1]: ['Pernod has reduced the debt it took on to fund the Seagram purchase to just 1.8bn euros, while Allied has improved the performance of its fast-food chains.',
 "Shares in UK drinks and food firm Allied Domecq have risen on speculation that it could be the target of a takeover by France's Pernod Ricard."]