忽略拆分字符串的日期和缩写

时间:2016-01-15 13:33:56

标签: javascript arrays regex replace

使用这段代码,我试图将一个字符串拆分成句子。这几乎正​​常,因为缩写(总是有固定格式 s.s.!)像文字一样处理,因此之后就不会有分裂。

但我也需要这个日期,格式为x.x.xx.xx.x.x.xx ...(总是数字!)

content = "This is a string with numbers (123.456,78 or 100.000), dates (01.01. or 1.2. or 1.02.16) and e.g. some abbreviations in it, which shouldn't split the sentence. dates and abbreviations should be ignored for splitting the string. So in this case, there are three sentences"

var result = content.replace(/\b(\w\.\w\.)|([.?!])\s+(?=[A-Za-z])/g, function(m, g1, g2){
    return g1 ? g1 : g2+"\r";
});
var arr = result.split("\r");

所以我想\b(\w\.\w\.)也应该扩展到数字,这可能在一个点前面有一个或两个数字,或者作为可选年份有两个/四个数字。

此示例中的结果应该在数组中包含三个元素。

1 个答案:

答案 0 :(得分:1)

只需在正则表达式的第一部分添加\d+(?:\.\d+){1,2}\.?替代方法:

content = "This is a string with numbers (123.456,78 or 100.000), dates (01.01. or 1.2. or 1.02.16) and e.g. some abbreviations in it, which shouldn't split the sentence. dates and abbreviations should be ignored for splitting the string. So in this case, there are three sentences"

var result = content.replace(/\b(\w\.\w\.|\d+(?:\.\d+){1,2}\.?)|([.?!])\s+(?=[A-Za-z])/g, function(m, g1, g2){
    return g1 ? g1 : g2 + "\r";
});
var arr = result.split("\r");
document.body.innerHTML =  "<pre>" + JSON.stringify(arr, 0, 4) + "</pre>";

\d+(?:\.\d+){1,2}\.?子模式匹配:

  • \d+ - 一个或多个数字后跟......
  • (?:\.\d+){1,2} - 一个点的1或2个序列,后跟一个或多个数字
  • \.? - 以及一个可选的点/