正则表达式将字符串拆分为单词

时间:2012-03-08 14:23:02

标签: javascript regex mootools

我有以下文字:

    <span term="db6ff2ffe2df7b8cfc0d9542bdce27dc" class="yellowback">Lorem</span> <span term="e78f5438b48b39bcbdea61b73679449d" class="yellowback">ipsum</span> dolor sit amet,   consectetur adipiscing elit.
Ut ut mattis sapien.   Suspendisse at felis nisl.   Vestibulum nec risus leo,   in consectetur dolor.   Duis suscipit arcu quis nibh dapibus gravida.   Ut vel rhoncus neque.   Sed et dolor quis est sollicitudin vulputate.   Nam vehicula,   tortor at consectetur laoreet,   nulla erat ultrices dui,   vehicula varius odio sem sed ligula.
Vivamus porttitor odio sed ligula cursus non placerat dolor posuere.
Pellentesque vitae metus vel dolor lobortis feugiat.   Nunc faucibus commodo viverra.   Aliquam porta nisl eu turpis vulputate id laoreet odio lobortis.   Proin sit amet neque nibh,   eget tincidunt est.   Etiam accumsan erat at mauris lacinia porta.
Suspendisse auctor,   quam sit amet congue consequat,   dolor orci placerat diam,   sed ultricies diam ipsum nec tortor.   Vestibulum egestas ipsum ut leo fermentum imperdiet.   Mauris varius iaculis magna,   id luctus risus vestibulum vel.

我想把它分成单词,但如果仔细观察,有些单词可能会包含在某些标签中。我想要做的是:如果单词在标签内,它应该将标签整体视为单词。现在我有以下正则表达式来完成这个:

(<span.+>|\w+|<\/span>)

这有效,但如果有2个adiacent标签,它会捕获它们并将它们视为一个我不想要的词。

我不喜欢使用Regex这个东西,但它似乎是最合适的解决方案,因为它必须是javascript,我无法使用第三方库。然而,我对另一种方法持开放态度,使用某种算法...如果不是正则表达式就好了。

令人满意的结果如下

["<span term=\"db6ff2ffe2df7b8cfc0d9542bdce27dc\" class=\"yellow\">Lorem</span>", "<span term=\"e78f5438b48b39bcbdea61b73679449d\" class=\"yellow\">Ipsum</span>", "is", "simply", "dummy", "text", "of", "the", "printing", "and", "typesetting", "industry", ".
     ", "Lorem", "Ipsum", "has", "been", "the", "industry", " ' ", "s", "standard", "<span term=\"e78f5438b48b39bcbdea61b73679449d\" class=\"yellow\">Ipsum</span>", "text", "ever", "since", "the", "1500s", ",
     ", "when", "an", "unknown", "printer", "took", "a", "galley", "of", "type", "and", "scrambled", "it", "to", "make", "a", "type", "specimen", "book", ".
     ", "It", "has", "survived", "not", "only", "five", "centuries", ",  ", "but", "also", "the", "leap", "into", "electronic", "typesetting", ",
     ", "remaining", "essentially", "unchanged", ".  ", "It", "was", "<span term=\"e78f5438b48b39bcbdea61b73679449d\" class=\"yellow\">Ipsum</span>", "in", "the", "1960s", "with", "the", "release", "of", "Letraset", "sheets", "containing", "Lorem", "Ipsum", "passages", ",  ", "and", "more", "recently", "with", "desktop", "publishing", "software", "like", "Aldus", "PageMaker", "including", "versions", "of", "Lorem", "Ipsum", ".
"]

不是一个好的结果如下:

["<span term=\"db6ff2ffe2df7b8cfc0d9542bdce27dc\" class=\"yellow\">Lorem</span> <span term=\"e78f5438b48b39bcbdea61b73679449d\" class=\"yellow\">Ipsum</span>", "is", "simply", "dummy", "text", "of", "the", "printing", "and", "typesetting", "industry", ".
         ", "Lorem", "Ipsum", "has", "been", "the", "industry", " ' ", "s", "standard", "<span term=\"e78f5438b48b39bcbdea61b73679449d\" class=\"yellow\">Ipsum</span>", "text", "ever", "since", "the", "1500s", ",
         ", "when", "an", "unknown", "printer", "took", "a", "galley", "of", "type", "and", "scrambled", "it", "to", "make", "a", "type", "specimen", "book", ".
         ", "It", "has", "survived", "not", "only", "five", "centuries", ",  ", "but", "also", "the", "leap", "into", "electronic", "typesetting", ",
             ", "remaining", "essentially", "unchanged", ".  ", "It", "was", "<span term=\"e78f5438b48b39bcbdea61b73679449d\" class=\"yellow\">Ipsum</span>", "in", "the", "1960s", "with", "the", "release", "of", "Letraset", "sheets", "containing", "Lorem", "Ipsum", "passages", ",  ", "and", "more", "recently", "with", "desktop", "publishing", "software", "like", "Aldus", "PageMaker", "including", "versions", "of", "Lorem", "Ipsum", ".
    "]

注意第2个示例中的2个跨度如何形成1个数组元素,而在第一个示例中,它们是2个不同的元素。

1 个答案:

答案 0 :(得分:0)

怎么样:

str.split(/(<span[^>]*>[^<]+<\/span>|\w+)/)