Question

目前我正在开发一个将长列分成短列的应用程序。为此我将整个文本分成单词，但此刻我的正则表达式也将数字拆分。

我做的是这个：

str = "This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence.";
sentences = str.replace(/\.+/g,'.|').replace(/\?/g,'?|').replace(/\!/g,'!|').split("|");

结果是：

Array [
    "This is a long string with some numbers [125.",
    "000,55 and 140.",
    "000] and an end.",
    " This is another sentence."
]

期望的结果是：

Array [
    "This is a long string with some numbers [125.000, 140.000] and an end.",
    "This is another sentence"
]

我如何更改我的正则表达式来实现这一目标？我是否需要注意可能遇到的一些问题？或者搜索". "，"? "和"! "是否足够好？

Answer 1

str.replace(/([.?!])\s*(?=[A-Z])/g, "$1|").split("|")

输出：

[ 'This is a long string with some numbers [125.000,55 and 140.000] and an end.',
  'This is another sentence.' ]

故障：

([.?!]) =捕获.或?或!

\s* =在前一个标记([.?!])后面捕获0个或多个空格字符。这会考虑与英语语法匹配的标点符号后面的空格。

(?=[A-Z]) =如果下一个字符在A-Z范围内（大写字母A到大写字母Z），则前一个标记只匹配。大多数英语语句以大写字母开头。以前的正则表达都没有考虑到这一点。

替换操作使用：

"$1|"

我们使用了一个“捕获群组”([.?!])，我们捕获其中一个字符，并将其替换为$1（匹配）加|。因此，如果我们抓取?，那么替换将为?|。

最后，我们拆分管道|并得到我们的结果。

所以，基本上，我们所说的是：

1）找到标点符号（.或?或!之一）并捕捉它们

2）标点符号可以选择包含空格。

3）在标点符号后，我希望有一个大写字母。

与之前提供的正则表达式不同，这将与英语语法完全匹配。

从那里：

4）我们通过附加管道|

来替换捕获的标点符号

5）我们拆分管道以创建一系列句子。

Answer 2

str.replace(/(\.+|\:|\!|\?)(\"*|\'*|\)*|}*|]*)(\s|\n|\r|\r\n)/gm, "$1$2|").split("|")

RegExp（见Debuggex）：

（。+ |：|！| \？）=句子不仅可以以“。”，“！”结尾。或“？”，但也可以是“......”或“：”
（\“ | \' |）* |} |] ）=句子可以用四分音符或括号括起来
（\ s | \ n | \ r | \ r \ n）=在句子必须是空格或行尾之后
g = global
m = multiline

说明：

如果使用（？= [A-Z]），RegExp将无法在某些语言中正常运行。例如。 “Ü”，“Č”或“Á”将无法识别。

Answer 3

您可以利用下一句以大写字母或数字开头。

.*?(?:\.|!|\?)(?:(?= [A-Z0-9])|$)

Regular expression visualization

Debuggex Demo

它拆分此文本

This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence. Sencenes beginning with numbers work. 10 people like that.

进入句子：

This is a long string with some numbers [125.000,55 and 140.000] and an end.
This is another sentence.
Sencenes beginning with numbers work.
10 people like that.

jsfiddle

Answer 4

使用前瞻以避免替换点，如果没有后跟空格+单词字符：

sentences = str.replace(/(?=\s*\w)\./g,'.|').replace(/\?/g,'?|').replace(/\!/g,'!|').split("|");

<强>输出：

["This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence."]

Answer 5

使用前瞻更安全，确保点后面的内容不是数字。

var str ="This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence."

var sentences = str.replace(/\.(?!\d)/g,'.|');
console.log(sentences);

如果你想要更安全，你可以检查后面的数字是否也是数字，但由于JS不支持lookbehind，你需要捕获前一个字符并在替换字符串中使用它。

var str ="This is another sentence.1 is a good number"

var sentences = str.replace(/\.(?!\d)|([^\d])\.(?=\d)/g,'$1.|');
console.log(sentences);

一个更简单的解决方案是逃避数字内部的点（例如用$$$$替换它们），进行拆分然后取消点。

Answer 6

你忘了把'\ s'放在你的正则表达式中。

尝试这个

var str = "This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence.";
var sentences = str.replace(/\.\s+/g,'.|').replace(/\?\s/g,'?|').replace(/\!\s/g,'!|').split("|");
console.log(sentences[0]);
console.log(sentences[1]);

http://jsfiddle.net/hrRrW/

Answer 7

我只想更改字符串并在每个句子之间添加一些内容。你告诉我你有权改变它们，这样做会更容易。

\r\n

通过执行此操作，您有一个要搜索的字符串，您不需要使用这些复杂的正则表达式。

如果你想以更难的方式使用正则表达式来寻找＆＃34;。＆＃34; ＆＃34;＆＃34; ＆＃34;！＆＃34;以下是大写字母。就像泰西告诉你的那样。

Answer 8

@Roger Poon和@AntonínSlejška的答案很好。

最好添加修剪功能并过滤空字符串：

const splitBySentence = (str) => {
  return str.replace(/([.?!])(\s)*(?=[A-Z])/g, "$1|")
    .split("|")
    .filter(sentence => !!sentence)
    .map(sentence => sentence.trim());
}

const splitBySentence = (str) => {
  return str.replace(/([.?!])(\s)*(?=[A-Z])/g, "$1|").split("|").filter(sentence => !!sentence).map(sentence => sentence.trim());
}

const content = `
The Times has identified the following reporting anomalies or methodology changes in the data for New York:

May 6: New York State added many deaths from unspecified days after reconciling data from nursing homes and other care facilities.

June 30: New York City released deaths from earlier periods but did not specify when they were from.

Aug. 6: Our database changed to record deaths by New York City residents instead of deaths that took place in New York City.

Aug. 20: New York City removed four previously reported deaths after reviewing records. The state reported four new deaths in other counties.(extracted from NY Times)
`;

console.log(splitBySentence(content));

在javascript中将字符串拆分为句子

8 个答案: