我需要从字符串中过滤掉几百个“停止”字样。由于有许多“停止”字样,我认为做这样的事情并不是一个好主意:
sentence.replace(/\b(?:the|it is|we all|an?|by|to|you|[mh]e|she|they|we...)\b/ig, '');
如何创建类似哈希映射的内容来存储停用词?在此映射中,键本身就是一个停用词,值并不重要。然后过滤将检查停用词映射中是否存在该单词。用于构建此类地图的数据结构是什么?
答案 0 :(得分:1)
没有任何东西可以胜过这种工作的正则表达式。但是,它们存在两个问题 - 难以维护(您在帖子中指出的内容)和非常大的性能问题。我不知道单个正则表达式可以处理多少个替代品,但我想在任何情况下都可以达到20-30个。
因此,您需要一些代码来从某些数据结构动态构建正则表达式,这些数据结构可以是数组,也可以只是字符串。我个人更喜欢刺痛,因为它最容易维持。
// taken from http://www.ranks.nl/resources/stopwords.html
stops = ""
+"a about above after again against all am an and any are aren't as "
+"at be because been before being below between both but by can't "
+"cannot could couldn't did didn't do does doesn't doing don't down "
+"during each few for from further had hadn't has hasn't have "
+"haven't having he he'd he'll he's her here here's hers herself "
+"him himself his how how's i i'd i'll i'm i've if in into is isn't "
+"it it's its itself let's me more most mustn't my myself no nor "
+"not of off on once only or other ought our ours ourselves out "
+"over own same shan't she she'd she'll she's should shouldn't so "
+"some such than that that's the their theirs them themselves then "
+"there there's these they they'd they'll they're they've this "
+"those through to too under until up very was wasn't we we'd we'll "
+"we're we've were weren't what what's when when's where where's "
+"which while who who's whom why why's with won't would wouldn't "
+"you you'd you'll you're you've your yours yourself yourselves "
// how many to replace at a time
reSize = 20
// build regexps
regexes = []
stops = stops.match(/\S+/g).sort(function(a, b) { return b.length - a.length })
for (var n = 0; n < stops.length; n += reSize)
regexes.push(new RegExp("\\b(" + stops.slice(n, n + reSize).join("|") + ")\\b", "gi"));
一旦你有了这个,其余的是显而易见的:
regexes.forEach(function(r) {
text = text.replace(r, '')
})
您需要尝试使用reSize
值来找出正则表达式长度与正则表达式总数之间的最佳平衡。如果性能很关键,您也可以运行生成部分一次,然后在某处缓存结果(即生成的正则表达式)。