目前我正在开发一个将长列分成短列的应用程序。为此我将整个文本分成单词,但此刻我的正则表达式也将数字拆分。
我做的是这个:
str = "This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence.";
sentences = str.replace(/\.+/g,'.|').replace(/\?/g,'?|').replace(/\!/g,'!|').split("|");
结果是:
Array [
"This is a long string with some numbers [125.",
"000,55 and 140.",
"000] and an end.",
" This is another sentence."
]
期望的结果是:
Array [
"This is a long string with some numbers [125.000, 140.000] and an end.",
"This is another sentence"
]
我如何更改我的正则表达式来实现这一目标?我是否需要注意可能遇到的一些问题?或者搜索". "
,"? "
和"! "
是否足够好?
答案 0 :(得分:27)
str.replace(/([.?!])\s*(?=[A-Z])/g, "$1|").split("|")
输出:
[ 'This is a long string with some numbers [125.000,55 and 140.000] and an end.',
'This is another sentence.' ]
故障:
([.?!])
=捕获.
或?
或!
\s*
=在前一个标记([.?!])
后面捕获0个或多个空格字符。这会考虑与英语语法匹配的标点符号后面的空格。
(?=[A-Z])
=如果下一个字符在A-Z范围内(大写字母A到大写字母Z),则前一个标记只匹配。大多数英语语句以大写字母开头。以前的正则表达都没有考虑到这一点。
替换操作使用:
"$1|"
我们使用了一个“捕获群组”([.?!])
,我们捕获其中一个字符,并将其替换为$1
(匹配)加|
。因此,如果我们抓取?
,那么替换将为?|
。
最后,我们拆分管道|
并得到我们的结果。
所以,基本上,我们所说的是:
1)找到标点符号(.
或?
或!
之一)并捕捉它们
2)标点符号可以选择包含空格。
3)在标点符号后,我希望有一个大写字母。
与之前提供的正则表达式不同,这将与英语语法完全匹配。
从那里:
4)我们通过附加管道|
5)我们拆分管道以创建一系列句子。
答案 1 :(得分:7)
str.replace(/(\.+|\:|\!|\?)(\"*|\'*|\)*|}*|]*)(\s|\n|\r|\r\n)/gm, "$1$2|").split("|")
RegExp(见Debuggex):
说明:
答案 2 :(得分:6)
您可以利用下一句以大写字母或数字开头。
.*?(?:\.|!|\?)(?:(?= [A-Z0-9])|$)
它拆分此文本
This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence. Sencenes beginning with numbers work. 10 people like that.
进入句子:
This is a long string with some numbers [125.000,55 and 140.000] and an end.
This is another sentence.
Sencenes beginning with numbers work.
10 people like that.
答案 3 :(得分:4)
使用前瞻以避免替换点,如果没有后跟空格+单词字符:
sentences = str.replace(/(?=\s*\w)\./g,'.|').replace(/\?/g,'?|').replace(/\!/g,'!|').split("|");
<强>输出:强>
["This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence."]
答案 4 :(得分:4)
使用前瞻更安全,确保点后面的内容不是数字。
var str ="This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence."
var sentences = str.replace(/\.(?!\d)/g,'.|');
console.log(sentences);
如果你想要更安全,你可以检查后面的数字是否也是数字,但由于JS不支持lookbehind,你需要捕获前一个字符并在替换字符串中使用它。
var str ="This is another sentence.1 is a good number"
var sentences = str.replace(/\.(?!\d)|([^\d])\.(?=\d)/g,'$1.|');
console.log(sentences);
一个更简单的解决方案是逃避数字内部的点(例如用$$$$替换它们),进行拆分然后取消点。
答案 5 :(得分:3)
你忘了把'\ s'放在你的正则表达式中。
尝试这个
var str = "This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence.";
var sentences = str.replace(/\.\s+/g,'.|').replace(/\?\s/g,'?|').replace(/\!\s/g,'!|').split("|");
console.log(sentences[0]);
console.log(sentences[1]);
答案 6 :(得分:3)
我只想更改字符串并在每个句子之间添加一些内容。 你告诉我你有权改变它们,这样做会更容易。
\r\n
通过执行此操作,您有一个要搜索的字符串,您不需要使用这些复杂的正则表达式。
如果你想以更难的方式使用正则表达式来寻找&#34;。&#34; &#34;&#34; &#34;!&#34;以下是大写字母。就像泰西告诉你的那样。
答案 7 :(得分:0)
@Roger Poon和@AntonínSlejška的答案很好。
最好添加修剪功能并过滤空字符串:
const splitBySentence = (str) => {
return str.replace(/([.?!])(\s)*(?=[A-Z])/g, "$1|")
.split("|")
.filter(sentence => !!sentence)
.map(sentence => sentence.trim());
}
const splitBySentence = (str) => {
return str.replace(/([.?!])(\s)*(?=[A-Z])/g, "$1|").split("|").filter(sentence => !!sentence).map(sentence => sentence.trim());
}
const content = `
The Times has identified the following reporting anomalies or methodology changes in the data for New York:
May 6: New York State added many deaths from unspecified days after reconciling data from nursing homes and other care facilities.
June 30: New York City released deaths from earlier periods but did not specify when they were from.
Aug. 6: Our database changed to record deaths by New York City residents instead of deaths that took place in New York City.
Aug. 20: New York City removed four previously reported deaths after reviewing records. The state reported four new deaths in other counties.(extracted from NY Times)
`;
console.log(splitBySentence(content));