Question

假设我有一个字符串：

string1 = 'Pernod has reduced the debt it took on to fund the Seagram purchase to just 1.8bn euros, while Allied has improved the performance of its fast-food chains.Shares in UK drinks and food firm Allied Domecq have risen on speculation that it could be the target of a takeover by France's Pernod Ricard.'

我要处理大量的文章，在这些文章中，时间段后并不总是留有空格，但有些却可以。如何将文本拆分为句子而不拆分小数？ TIA。

Answer 1

这可以使用正则表达式re.split()来实现，假设声明性句子的末尾没有数字，并且后面是一个以数字开头的句子，且句子之间没有空格（例如，“这是我的句子”我的下一个句子的开头是1.2。”；第一个句子以“ 1.”结尾，第二个句子以“ 2”开头。

也就是说，单独的split（）将无法执行所需的操作。还值得注意的是，因为撇号比引号更常见，所以用引号分隔字符串可能会更好。就目前而言，句子的结尾“ Pernod Richard。”不被视为字符串的一部分，因此被视为无效语法。

string1 = "Pernod has reduced the debt it took on to fund the Seagram purchase to just 1.8bn euros, while Allied has improved the performance of its fast-food chains.Shares in UK drinks and food firm Allied Domecq have risen on speculation that it could be the target of a takeover by France's Pernod Ricard."

sentences = re.split('[^0-9]["."][^0-9]', string1)

Answer 2

执行此操作的一种方法是保护不希望分割文本的点，方法是先将它们替换为其他内容，然后在分割后再次替换占位符：

import re
# replace dots that have numbers around them with "[PROTECTED_DOT]"
string1_protected = re.sub(r"(\d)\.(\d)", r"\1[PROTECTED_DOT]\2", string1)  
# now split (and remove empty lines)
lines_protected = [line + "." for line in string1_protected.split(".") if line]   
# now re-replace all "[PROTECTED_DOT]"s
lines = [line.replace("[PROTECTED_DOT]", ".") for line in lines_protected]

结果：

In [1]: lines
Out[1]: ['Pernod has reduced the debt it took on to fund the Seagram purchase to just 1.8bn euros, while Allied has improved the performance of its fast-food chains.',
 "Shares in UK drinks and food firm Allied Domecq have risen on speculation that it could be the target of a takeover by France's Pernod Ricard."]

在Python中，如何在句点后分割字符串而不影响十进制数字？

2 个答案: