将字符串拆分为行和句子,但忽略缩写

时间:2016-10-06 18:38:19

标签: javascript regex string split

有一些字符串内容,我必须拆分。首先,我需要将字符串内容拆分为行。

我就是这样做的:

str.split('\n').forEach((item) => {
    if (item) {
        // TODO: split also each line into sentences

        let     data       = {
                    type   : 'item',
                    content: [{
                        content  : item,
                        timestamp: Math.floor(Date.now() / 1000)
                    }]
                };

        // Save `data` to DB
    }
});

但现在我需要将每一行分成句子。对我来说,困难在于正确地拆分它。因此,我会使用.(点和空格)来分割线。 但是还有一个缩写数组,不应该分割线:

cont abbr = ['vs.', 'min.', 'max.']; // Just an example; there are 70 abbrevations in that array

......还有一些规则:

  1. 任何数字,点或单个字母和点也应作为拆分字符串忽略:1.2.30.A.b. < / LI>
  2. 应忽略大小写:Max. Lorem ipsum不应拆分。 Lorem max. ipsum也是。
  3. 示例

    const str = 'Just some examples:\nThis example has min. 2 lines. Max. 10 lines. There are some words: 1. Foo and 2. bar.';
    

    结果应该是四个数据对象:

    { type: 'item', content: [{ content: 'Just some examples:', timestamp: 123 }] }
    { type: 'item', content: [{ content: 'This example has min. 2 lines.', timestamp: 123 }] }
    { type: 'item', content: [{ content: 'Max. 10 lines.', timestamp: 123 }] }
    { type: 'item', content: [{ content: 'There are some words: 1. Foo and 2. bar.', timestamp: 123 }] }
    

1 个答案:

答案 0 :(得分:1)

您可以先检测字符串中的缩写和编号,然后用每个虚拟字符串替换该点。将字符串分割为剩余的点(表示句子结尾的信号)后,您可以恢复原始点。一旦你有了句子,就可以像在原始代码中那样将每个句子分成换行符。

更新后的代码允许缩写中包含多个点(如p.o.s.v.p.所示)。

&#13;
&#13;
var i, j, strRegex, regex, abbrParts;
const DOT = "_DOT_";
const abbr = ["p.o.", "s.v.p.", "vs.", "min.", "max."];

var str = 'Just some examples:\nThis example s.v.p. has min. 2 lines. Max. 10 lines. There are some words: 1. Foo and 2. bar. And also A. p.o. professional letters.';

console.log("String: " + str);

// Replace dot in abbreviations found in string
for (i = 0; i < abbr.length; i++) {
    abbrParts = abbr[i].split(".");
    strRegex = "(\\W*" + abbrParts[0] + ")";
    for (j = 1; j < abbrParts.length - 1; j++) {
        strRegex += "(\\.)(" + abbrParts[j] + ")";
    }
    strRegex += "(\\.)(" + abbrParts[abbrParts.length - 1] + "\\W*)";
    regex = new RegExp(strRegex, "gi");
    str = str.replace(regex, function () {
        var groups = arguments;
        var result = groups[1];
        for (j = 2; j < groups.length; j += 2) {
            result += (groups[j] === "." ? DOT + groups[j+1] : "");
        }
        return result;
    });
}

// Replace dot in numbers found in string
str = str.replace(/(\W*\d+)(\.)/gi, "$1" + DOT);

// Replace dot in letter numbering found in string
str = str.replace(/(\W+[a-zA-Z])(\.)/gi, "$1" + DOT);

// Split the string at dots
var parts = str.split(".");

// Restore dots in sentences
var sentences = [];
regex = new RegExp(DOT, "gi");
for (i = 0; i < parts.length; i++) {
    if (parts[i].trim().length > 0) {
        sentences.push(parts[i].replace(regex, ".").trim() + ".");
        console.log("Sentence " + (i + 1) + ": " + sentences[i]);
    }
}
&#13;
&#13;
&#13;