正则表达式切离关键字最近的单词

时间:2018-08-04 18:26:09

标签: javascript regex extract keyword

我正在尝试用javascript制作关键字提取器,但它也会包含一些上下文。有很多步骤,但是大多数步骤都非常简单,除了在段落中的关键字旁边包括不重要的单词。我想剪裁所选关键字两侧的两个单词以及该关键字。例如,如果我有句子

let sentence = 'I was walking down the street when, suddenly, the TV came on.'

关键字为street,我想从句子中提取down the street when suddenly。最终,我将删除所有停用词(例如the),但目前我只想提取所有停用词。我一直在使用正则表达式来尝试实现这一目标,但一直没有成功。这是我的代码:

let removePunc = sentence.replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g,""); //Removes the commas and other puncuation that could interfere with my extraction
let removeSpace = removePunc.replace(/\s{2,}/g," ");  //Removes additional whitespace that's not required
const regex = new RegExp('([^\s]+\s[^\s]+\s' + keyword + '\s[^\s]+\s[^\s]+)', 'gs') //Here's where I was trying to get the two words on either side of the keyword, although it currently doesn't work
let keywordZone = regex.exec(removeSpace); //This is where the regex above should "cut out" the phrase I want

我对正则表达式不太满意,并且对为什么它不能正常工作感到有些困惑,因为它似乎适用于this regex simulator.

上的特定示例

如果我现在尝试,它什么也没做。例如,句子Lawmakers, flight attendants, passengers oppose TSA proposal to cut screening at airports first reported by CNN和关键字proposal根本没有任何作用。

在此先感谢您的答复,非常感谢!

2 个答案:

答案 0 :(得分:0)

删除标点符号后,您可以在每个空格处拆分句子,然后从该数组中选择单词前后的两个元素:

let sentence = 'I was walking down the street when, suddenly, the TV came on.'
let keyword = "street";


let removePunc = sentence.replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g,""); //Removes the commas and other puncuation that could interfere with my extraction

let wordArr = removePunc.split(" ");

let keyPos = wordArr.indexOf(keyword);

let newSentence = [wordArr[keyPos-2], wordArr[keyPos-1], wordArr[keyPos], wordArr[keyPos+1], wordArr[keyPos+2],].join(" ");

console.log(newSentence)

如果将其放入函数中,也可以轻松地在其他字符串上对其进行测试:

function nearestFourWords(sentence, keyword) {
  let removePunc = sentence.replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g, ""); //Removes the commas and other puncuation that could interfere with my extraction

  let wordArr = removePunc.split(" ");

  let keyPos = wordArr.indexOf(keyword);

  let newSentence = [wordArr[keyPos - 2], wordArr[keyPos - 1], wordArr[keyPos], wordArr[keyPos + 1], wordArr[keyPos + 2], ].join(" ");

  return newSentence
}

test1 = ["Lawmakers, flight attendants, passengers oppose TSA proposal to cut screening at airports first reported by CNN", "proposal"];

console.log(nearestFourWords(test1[0], test1[1]));

如果您以后要删除诸如the之类的单词,只需在拆分之前在其中添加这些行即可!

答案 1 :(得分:0)

如果您希望/需要使用正则表达式,那么这是一种简单的方法。

const sentence = 'I was walking down the street when, suddenly, the TV came on.'
const keyword = 'street';
const regex = `\\w+\\W+\\w+\\W+${keyword}\\W+\\w+\\W+\\w+`;

console.log(sentence.match(regex));

将其重构为函数很快显示出一个缺点,即,如果关键字位于字符串开头或结尾的两个单词之内,搜索将失败。

const sentence = 'I was walking down the street when, suddenly, the TV came on.'

console.log({
  street: keywordSearch(sentence, 'street'),
  I: keywordSearch(sentence, 'I'),
  was: keywordSearch(sentence, 'was'),
  came: keywordSearch(sentence, 'came'),
  on: keywordSearch(sentence, 'on')
});

function keywordSearch(str, key) {
  const regex = `\\w+\\W+\\w+\\W+${key}\\W+\\w+\\W+\\w+`;
  
  return str.match(regex);
}

这可以通过使用可选分组来缓解。

const sentence = 'I was walking down the street when, suddenly, the TV came on.'

console.log({
  street: keywordSearch(sentence, 'street'),
  I: keywordSearch(sentence, 'I'),
  was: keywordSearch(sentence, 'was'),
  came: keywordSearch(sentence, 'came'),
  on: keywordSearch(sentence, 'on')
});

function keywordSearch(str, key) {
  const regex = `(?:\\w+\\W+|)(?:\\w+\\W+|)${key}(?:\\W+\\w+|)(?:\\W+\\w+|)`;
  
  return str.match(regex);
}

希望这能帮到您。