删除动词作为禁用词

时间:2017-11-13 22:35:21

标签: r nlp

有些词有时用作动词,有时也用作其他词性。

示例

一个带有动词的含义的句子:

I blame myself for what happened

一个带有单词含义的句子:

For what happened the blame is yours

我想知道我要检测的这个词,在上面的例子中是“责备”。我希望只有当它具有动词的含义时才能检测并删除作为停用词。

有没有简单的方法来制作它?

3 个答案:

答案 0 :(得分:3)

您可以install TreeTagger然后使用R中的koRpus包来使用R中的TreeTagger。将其安装在类似于C:\Treetagger

我将首先展示treetagger的工作原理,以便您在下面的答案中了解实际解决方案中的内容:

介绍treetagger

library(koRpus)

your_sentences <- c("I blame myself for what happened", 
                    "For what happened the blame is yours")

text.tagged <- treetag(file="I blame myself for what happened", 
                  format="obj", treetagger="manual", lang="en",
                  TT.options = list(path="C:\\Treetagger", preset="en") )
text.tagged@TT.res[, 1:2]
#       token tag    
#1         I  PP
#2     blame VVP 
#3    myself  PP 
#4       for  IN
#5      what  WP
#6  happened VVD 

现在已经分析了句子,并且只剩下了#34;是删除那些动词"blame"的出现。

解决方案

我会通过创建一个首先标记句子的函数,然后检查&#34;坏词&#34;来做句子判刑。像"blame"一样也是一个动词,最后将它们从句子中移除:

remove_words <- function(sentence, badword="blame"){
  tagged.text <- treetag(file=sentence, format="obj", treetagger="manual", lang="en", 
                         TT.options=list(path=":C\\Treetagger", preset="en"))
  # Check for bad words AND verb:
  cond1 <- (tagged.text@TT.res$token == badword)
  cond2 <- (substring(tagged.text@TT.res$tag, 0, 1) == "V")
  redflag <- which(cond1 & cond2)

  # If no such case, return sentence as is. If so, then remove that word:
  if(length(redflag) == 0) return(sentence)
  else{
    splitsent <- strsplit(sentence, " ")[[1]]
    splitsent <- splitsent[-redflag]
    return(paste0(splitsent, collapse=" "))
  }
}

lapply(your_sentences, remove_words)
# [[1]]
# [1] "I myself for what happened"
# [[2]]
# [1] "For what happened the blame is yours"

答案 1 :(得分:2)

在python中完成:

from nltk import pos_tag
s1 = "I blame myself for what happened"
pos_tag(s1.split())

它会给你带有标签的单词

答案 2 :(得分:2)

你可以在Python中做这样的事情  

import ntlk
>>> text = word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]

并添加你的过滤器以消除动词。

希望这有用!