我有以下情况。
我有一个页面,我们有一个tinymce编辑器,我们可以粘贴文本。可以选择限制要在编辑器中粘贴的字符或单词。
我有这样的文字
<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod<br />tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,<br />quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo<br />consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse<br />cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non<br />proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod<br />tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,<br />quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo<br />consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse<br />cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non<br />proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod<br />tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,<br />quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo<br />consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse<br />cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non<br />proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod<br />tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,<br />quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo<br />consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse<br />cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non<br />proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod<br />tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,<br />quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo<br />consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse<br />cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non<br />proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod<br />tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,<br />quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo<br />consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse<br />cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non<br />proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p><p>
根据SublimeText,长度为342个字。
如果我删除了html标签,Sublime说它长368个字,而MS Word则是379个。
我试图找到一个RegEx来查找除html标签之外的所有单词,以便在我们的系统上使用正确的字数。
到目前为止,我已经尝试了
/[\w\u2019\'-]+/gim
但是这包括HTML标记内的字符,如此处所示
我也试过
(\s+|>)\w+
哪个越来越近,但这也包括&gt;作为html实体的一部分的标志,如此处所示
请记住,我无法在尖括号内替换文本,因为此文本编辑器用于提交科学和医学论文,因此在某些情况下,这些符号&lt;和&gt;用于表示法。
答案 0 :(得分:1)
TinyMCE的插件实际上存在counts the word of a given text。
这是tinymce/js/tinymce/plugins/wordcount/ 稍微适应的版本,应符合您的目的。
toPlainText = function(string) {
var tx = string;
var tc = 0;
if (tx) {
tx = tx.replace(/\.\.\./g, ' '); // convert ellipses to spaces
tx = tx.replace(/<.[^<>]*?>/g, ' ').replace(/ | /gi, ' '); // remove html tags and space chars
// deal with html entities
tx = tx.replace(/(\w+)(&#?[a-z0-9]+;)+(\w+)/i, "$1$3").replace(/&.+?;/g, ' ');
tx = tx.replace( /[0-9.(),;:!?%#$?\x27\x22_+=\\\/\-]*/g, ''); // remove numbers and punctuation
var wordArray = tx.match(/[\w\u2019\x27\-\u00C0-\u1FFF]+/g);
if (wordArray) {
tc = wordArray.join(" ");
}
}
var div = document.createElement('div');
div.innerHTML = tc;
return div.textContent;
}
document.write(toPlainText("<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod<br />tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,<br />quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo<br />consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse<br />cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non<br />"));
&#13;
答案 1 :(得分:0)
我会通过将其简化为:
来简化它var text = "<p>Lorem ipsum</p><p>Lorem ipsum</p><p sdf>Lorem ipsum</p>";
var words = text.replace(/(<([^\s>]+)>)/ig, " ").trim().split(/\s+/).length;
console.log(words); // output: 6
replace
删除所有html标记trim
split
使用正则表达式的所有单词(以便空格不算作单词)最后你有了多少个单词。
请注意我已使用以下正则表达式替换/(<([^\s>]+)>)/ig
:
<p>
,</p>
等标签,而这应该给你一个很好的近似值。
答案 2 :(得分:0)
答案 3 :(得分:0)
<div id="test">
<p>foofoofoofoofoo</p>
<h1>googoogoogoogoogoo</h1>
<script>
var allText;
var divElm = document.getElementById('test');
for (text in divElm.childNodes) {
allText += divElm.childNodes[text].textContent;
}
alert(allText);
</script>