我正在尝试用javascript制作关键字提取器,但它也会包含一些上下文。有很多步骤,但是大多数步骤都非常简单,除了在段落中的关键字旁边包括不重要的单词。我想剪裁所选关键字两侧的两个单词以及该关键字。例如,如果我有句子
let sentence = 'I was walking down the street when, suddenly, the TV came on.'
关键字为street
,我想从句子中提取down the street when suddenly
。最终,我将删除所有停用词(例如the
),但目前我只想提取所有停用词。我一直在使用正则表达式来尝试实现这一目标,但一直没有成功。这是我的代码:
let removePunc = sentence.replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g,""); //Removes the commas and other puncuation that could interfere with my extraction
let removeSpace = removePunc.replace(/\s{2,}/g," "); //Removes additional whitespace that's not required
const regex = new RegExp('([^\s]+\s[^\s]+\s' + keyword + '\s[^\s]+\s[^\s]+)', 'gs') //Here's where I was trying to get the two words on either side of the keyword, although it currently doesn't work
let keywordZone = regex.exec(removeSpace); //This is where the regex above should "cut out" the phrase I want
我对正则表达式不太满意,并且对为什么它不能正常工作感到有些困惑,因为它似乎适用于this regex simulator.
上的特定示例如果我现在尝试,它什么也没做。例如,句子Lawmakers, flight attendants, passengers oppose TSA proposal to cut screening at airports first reported by CNN
和关键字proposal
根本没有任何作用。
在此先感谢您的答复,非常感谢!
答案 0 :(得分:0)
删除标点符号后,您可以在每个空格处拆分句子,然后从该数组中选择单词前后的两个元素:
let sentence = 'I was walking down the street when, suddenly, the TV came on.'
let keyword = "street";
let removePunc = sentence.replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g,""); //Removes the commas and other puncuation that could interfere with my extraction
let wordArr = removePunc.split(" ");
let keyPos = wordArr.indexOf(keyword);
let newSentence = [wordArr[keyPos-2], wordArr[keyPos-1], wordArr[keyPos], wordArr[keyPos+1], wordArr[keyPos+2],].join(" ");
console.log(newSentence)
如果将其放入函数中,也可以轻松地在其他字符串上对其进行测试:
function nearestFourWords(sentence, keyword) {
let removePunc = sentence.replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g, ""); //Removes the commas and other puncuation that could interfere with my extraction
let wordArr = removePunc.split(" ");
let keyPos = wordArr.indexOf(keyword);
let newSentence = [wordArr[keyPos - 2], wordArr[keyPos - 1], wordArr[keyPos], wordArr[keyPos + 1], wordArr[keyPos + 2], ].join(" ");
return newSentence
}
test1 = ["Lawmakers, flight attendants, passengers oppose TSA proposal to cut screening at airports first reported by CNN", "proposal"];
console.log(nearestFourWords(test1[0], test1[1]));
如果您以后要删除诸如the
之类的单词,只需在拆分之前在其中添加这些行即可!
答案 1 :(得分:0)
如果您希望/需要使用正则表达式,那么这是一种简单的方法。
const sentence = 'I was walking down the street when, suddenly, the TV came on.'
const keyword = 'street';
const regex = `\\w+\\W+\\w+\\W+${keyword}\\W+\\w+\\W+\\w+`;
console.log(sentence.match(regex));
将其重构为函数很快显示出一个缺点,即,如果关键字位于字符串开头或结尾的两个单词之内,搜索将失败。
const sentence = 'I was walking down the street when, suddenly, the TV came on.'
console.log({
street: keywordSearch(sentence, 'street'),
I: keywordSearch(sentence, 'I'),
was: keywordSearch(sentence, 'was'),
came: keywordSearch(sentence, 'came'),
on: keywordSearch(sentence, 'on')
});
function keywordSearch(str, key) {
const regex = `\\w+\\W+\\w+\\W+${key}\\W+\\w+\\W+\\w+`;
return str.match(regex);
}
这可以通过使用可选分组来缓解。
const sentence = 'I was walking down the street when, suddenly, the TV came on.'
console.log({
street: keywordSearch(sentence, 'street'),
I: keywordSearch(sentence, 'I'),
was: keywordSearch(sentence, 'was'),
came: keywordSearch(sentence, 'came'),
on: keywordSearch(sentence, 'on')
});
function keywordSearch(str, key) {
const regex = `(?:\\w+\\W+|)(?:\\w+\\W+|)${key}(?:\\W+\\w+|)(?:\\W+\\w+|)`;
return str.match(regex);
}
希望这能帮到您。