有些词有时用作动词,有时也用作其他词性。
示例
一个带有动词的含义的句子:
I blame myself for what happened
一个带有单词含义的句子:
For what happened the blame is yours
我想知道我要检测的这个词,在上面的例子中是“责备”。我希望只有当它具有动词的含义时才能检测并删除作为停用词。
有没有简单的方法来制作它?
答案 0 :(得分:3)
您可以install TreeTagger然后使用R中的koRpus
包来使用R中的TreeTagger。将其安装在类似于C:\Treetagger
。
我将首先展示treetagger的工作原理,以便您在下面的答案中了解实际解决方案中的内容:
library(koRpus)
your_sentences <- c("I blame myself for what happened",
"For what happened the blame is yours")
text.tagged <- treetag(file="I blame myself for what happened",
format="obj", treetagger="manual", lang="en",
TT.options = list(path="C:\\Treetagger", preset="en") )
text.tagged@TT.res[, 1:2]
# token tag
#1 I PP
#2 blame VVP
#3 myself PP
#4 for IN
#5 what WP
#6 happened VVD
现在已经分析了句子,并且只剩下了#34;是删除那些动词"blame"
的出现。
我会通过创建一个首先标记句子的函数,然后检查&#34;坏词&#34;来做句子判刑。像"blame"
一样也是一个动词,最后将它们从句子中移除:
remove_words <- function(sentence, badword="blame"){
tagged.text <- treetag(file=sentence, format="obj", treetagger="manual", lang="en",
TT.options=list(path=":C\\Treetagger", preset="en"))
# Check for bad words AND verb:
cond1 <- (tagged.text@TT.res$token == badword)
cond2 <- (substring(tagged.text@TT.res$tag, 0, 1) == "V")
redflag <- which(cond1 & cond2)
# If no such case, return sentence as is. If so, then remove that word:
if(length(redflag) == 0) return(sentence)
else{
splitsent <- strsplit(sentence, " ")[[1]]
splitsent <- splitsent[-redflag]
return(paste0(splitsent, collapse=" "))
}
}
lapply(your_sentences, remove_words)
# [[1]]
# [1] "I myself for what happened"
# [[2]]
# [1] "For what happened the blame is yours"
答案 1 :(得分:2)
在python中完成:
from nltk import pos_tag
s1 = "I blame myself for what happened"
pos_tag(s1.split())
它会给你带有标签的单词
答案 2 :(得分:2)
你可以在Python中做这样的事情
import ntlk
>>> text = word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]
并添加你的过滤器以消除动词。
希望这有用!