在javascript中将字符串拆分为句子

时间:2013-09-20 10:34:16

标签: javascript regex

目前我正在开发一个将长列分成短列的应用程序。为此我将整个文本分成单词,但此刻我的正则表达式也将数字拆分。

我做的是这个:

str = "This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence.";
sentences = str.replace(/\.+/g,'.|').replace(/\?/g,'?|').replace(/\!/g,'!|').split("|");

结果是:

Array [
    "This is a long string with some numbers [125.",
    "000,55 and 140.",
    "000] and an end.",
    " This is another sentence."
]

期望的结果是:

Array [
    "This is a long string with some numbers [125.000, 140.000] and an end.",
    "This is another sentence"
]

我如何更改我的正则表达式来实现这一目标?我是否需要注意可能遇到的一些问题?或者搜索". ""? ""! "是否足够好?

8 个答案:

答案 0 :(得分:27)

str.replace(/([.?!])\s*(?=[A-Z])/g, "$1|").split("|")

输出:

[ 'This is a long string with some numbers [125.000,55 and 140.000] and an end.',
  'This is another sentence.' ]

故障:

([.?!]) =捕获.?!

\s* =在前一个标记([.?!])后面捕获0个或多个空格字符。这会考虑与英语语法匹配的标点符号后面的空格。

(?=[A-Z]) =如果下一个字符在A-Z范围内(大写字母A到大写字母Z),则前一个标记只匹配。大多数英语语句以大写字母开头。以前的正则表达都没有考虑到这一点。


替换操作使用:

"$1|"

我们使用了一个“捕获群组”([.?!]),我们捕获其中一个字符,并将其替换为$1(匹配)加|。因此,如果我们抓取?,那么替换将为?|

最后,我们拆分管道|并得到我们的结果。


所以,基本上,我们所说的是:

1)找到标点符号(.?!之一)并捕捉它们

2)标点符号可以选择包含空格。

3)在标点符号后,我希望有一个大写字母。

与之前提供的正则表达式不同,这将与英语语法完全匹配。

从那里:

4)我们通过附加管道|

来替换捕获的标点符号

5)我们拆分管道以创建一系列句子。

答案 1 :(得分:7)

str.replace(/(\.+|\:|\!|\?)(\"*|\'*|\)*|}*|]*)(\s|\n|\r|\r\n)/gm, "$1$2|").split("|")

RegExp(见Debuggex):

  • (。+ |:|!| \?)=句子不仅可以以“。”,“!”结尾。或“?”,但也可以是“......”或“:”
  • (\“ | \' |)* |} |] )=句子可以用四分音符或括号括起来
  • (\ s | \ n | \ r | \ r \ n)=在句子必须是空格或行尾之后
  • g = global
  • m = multiline

说明:

  • 如果使用(?= [A-Z]),RegExp将无法在某些语言中正常运行。例如。 “Ü”,“Č”或“Á”将无法识别。

答案 2 :(得分:6)

您可以利用下一句以大写字母或数字开头。

.*?(?:\.|!|\?)(?:(?= [A-Z0-9])|$)

Regular expression visualization

Debuggex Demo

它拆分此文本

This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence. Sencenes beginning with numbers work. 10 people like that.

进入句子:

This is a long string with some numbers [125.000,55 and 140.000] and an end.
This is another sentence.
Sencenes beginning with numbers work.
10 people like that.

jsfiddle

答案 3 :(得分:4)

使用前瞻以避免替换点,如果没有后跟空格+单词字符:

sentences = str.replace(/(?=\s*\w)\./g,'.|').replace(/\?/g,'?|').replace(/\!/g,'!|').split("|");

<强>输出:

["This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence."]

答案 4 :(得分:4)

使用前瞻更安全,确保点后面的内容不是数字。

var str ="This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence."

var sentences = str.replace(/\.(?!\d)/g,'.|');
console.log(sentences);

如果你想要更安全,你可以检查后面的数字是否也是数字,但由于JS不支持lookbehind,你需要捕获前一个字符并在替换字符串中使用它。

var str ="This is another sentence.1 is a good number"

var sentences = str.replace(/\.(?!\d)|([^\d])\.(?=\d)/g,'$1.|');
console.log(sentences);

一个更简单的解决方案是逃避数字内部的点(例如用$$$$替换它们),进行拆分然后取消点。

答案 5 :(得分:3)

你忘了把'\ s'放在你的正则表达式中。

尝试这个

var str = "This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence.";
var sentences = str.replace(/\.\s+/g,'.|').replace(/\?\s/g,'?|').replace(/\!\s/g,'!|').split("|");
console.log(sentences[0]);
console.log(sentences[1]);

http://jsfiddle.net/hrRrW/

答案 6 :(得分:3)

我只想更改字符串并在每个句子之间添加一些内容。 你告诉我你有权改变它们,这样做会更容易。

\r\n

通过执行此操作,您有一个要搜索的字符串,您不需要使用这些复杂的正则表达式。

如果你想以更难的方式使用正则表达式来寻找&#34;。&#34; &#34;&#34; &#34;!&#34;以下是大写字母。就像泰西告诉你的那样。

答案 7 :(得分:0)

@Roger Poon和@AntonínSlejška的答案很好。

最好添加修剪功能并过滤空字符串:

const splitBySentence = (str) => {
  return str.replace(/([.?!])(\s)*(?=[A-Z])/g, "$1|")
    .split("|")
    .filter(sentence => !!sentence)
    .map(sentence => sentence.trim());
}

const splitBySentence = (str) => {
  return str.replace(/([.?!])(\s)*(?=[A-Z])/g, "$1|").split("|").filter(sentence => !!sentence).map(sentence => sentence.trim());
}

const content = `
The Times has identified the following reporting anomalies or methodology changes in the data for New York:

May 6: New York State added many deaths from unspecified days after reconciling data from nursing homes and other care facilities.

June 30: New York City released deaths from earlier periods but did not specify when they were from.

Aug. 6: Our database changed to record deaths by New York City residents instead of deaths that took place in New York City.

Aug. 20: New York City removed four previously reported deaths after reviewing records. The state reported four new deaths in other counties.(extracted from NY Times)
`;

console.log(splitBySentence(content));