如何从文本中删除所有停止单词?

时间:2017-02-13 11:34:34

标签: javascript

我正在尝试使用此JavaScript代码:

var aStopWords = new Array ("a", "the", "blah"...);

(code to make it run, full code can be found here: https://jsfiddle.net/j2kbpdjr/)

// sText is the body of text that the keywords are being extracted from. 
// It's being separated into an array of words.

// remove stop words
for (var m = 0; m < aStopWords.length; m++) {
    sText = sText.replace(' ' + aStopWords[m] + ' ', ' ');
}

从文本正文中获取关键字。它工作得很好,但是,我遇到的问题是它似乎只是迭代并忽略数组aStopWords中单词的一个实例。

所以,如果我有以下正文:

how are you today? Are you well?

我放var aStopWords = new Array("are","well")然后它似乎会忽略are的第一个实例,但仍会将第二个are显示为关键字。而它将完全删除/忽略关键字中的well

如果有人可以帮助忽略关键字中aStopWords中所有字词的实例,我会非常感激。

1 个答案:

答案 0 :(得分:1)

你可以这样轻松地做到这一点。

首先,它将文本拆分为关键字。然后,它会遍历所有关键字。通过时,它会检查它是否是一个禁用词。如果是这样,它将被忽略。如果不是,result对象中此关键字的出现次数将会增加。

然后,关键字位于以下形式的JavaScript对象中:

{ "this": 1, "that": 2 }

对象在JavaScript中不可排序,但是数组是。因此,需要重新映射到以下结构:

[
    { "keyword": "this", "counter": 1 },
    { "keyword": "that", "counter": 2 }
]

然后,可以使用counter属性对数组进行排序。使用slice()函数,只能从排序列表中提取前X个值。

var stopwords = ["about", "all", "alone", "also", "am", "and", "as", "at", "because", "before", "beside", "besides", "between", "but", "by", "etc", "for", "i", "of", "on", "other", "others", "so", "than", "that", "though", "to", "too", "trough", "until"];
var text = document.getElementById("main").innerHTML;

var keywords = text.split(/[\s\.;:"]+/);
var keywordsAndCounter = {};
for(var i=0; i<keywords.length; i++) {
  var keyword = keywords[i];
  
  // keyword is not a stopword and not empty
  if(stopwords.indexOf(keyword.toLowerCase()) === -1 && keyword !== "") {
    if(!keywordsAndCounter[keyword]) {
      keywordsAndCounter[keyword] = 0;
    }
    keywordsAndCounter[keyword]++;
  }
}

// remap from { keyword: counter, keyword2: counter2, ... } to [{ "keyword": keyword, "counter": counter }, {...} ] to make it sortable
var result = [];
var nonStopKeywords = Object.keys(keywordsAndCounter);
for(var i=0; i<nonStopKeywords.length; i++) {
  var keyword = nonStopKeywords[i];
  result.push({ "keyword": keyword, "counter": keywordsAndCounter[keyword] });
}

// sort the values according to the number of the counter
result.sort(function(a, b) {
  return b.counter - a.counter;
});

var topFive = result.slice(0, 5);
console.log(topFive);
<div id="main">This is a test to show that it is all about being between others. I am there until 8 pm event though it will be late. Because it is "cold" outside even though it is besides me.</div>