计算数据帧R中每个语音的动词数

时间:2017-10-29 18:53:36

标签: r nlp text-mining tm opennlp

我有一个数据框如下:

str(data)
'data.frame':   255 obs. of  3 variables:
$ Group      : Factor w/ 255 levels "AlzGroup1","AlzGroup10",..: 1 112 179 190 201 212 223 234 245 2 ...
$ Gender     : int  1 1 0 0 0 0 0 1 0 0 ...
$ Description: Factor w/ 255 levels "A boy's on the uh falling off the stool picking up cookies . The girl's reaching up for it . The girl the lady "| __truncated__,..: 63 69 38 134 111 242 196 85 84 233 ...

在“描述”列中我有255个演讲,我想在每个语音中添加一个包含动词数量的数据框的列,我知道如何获取动词的数量,但下面的代码给出了描述中的动词总数柱:

> library(NLP);
> library(tm);
> library(openNLP);
NumOfVerbs=sapply(strsplit(as.character(tagPOS(data$Description)),"[[:punct:]]*/VB.?"),function(x) {res = sub("(^.*\\s)(\\w+$)", "\\2", x); res[!grepl("\\s",res)]} )

有谁知道如何在每次演讲中获得动词的数量?

感谢您的帮助!

Elahe

1 个答案:

答案 0 :(得分:0)

假设您使用的功能类似于此功能(在此处找到:could not find function tagPOS):

tagPOS <-  function(x, ...) {
  s <- as.String(x)
  word_token_annotator <- Maxent_Word_Token_Annotator()
  a2 <- Annotation(1L, "sentence", 1L, nchar(s))
  a2 <- annotate(s, word_token_annotator, a2)
  a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2)
  a3w <- a3[a3$type == "word"]
  POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
  POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
  list(POStagged = POStagged, POStags = POStags)
}

创建一个计算包含字母'VB'

的POS标记数量的函数
count_verbs <-function(x) {
  pos_tags <- tagPOS(x)$POStags
  sum(grepl("VB", pos_tags))
  }

并使用dplyrGroup进行分组,并使用count_verbs()进行汇总:

library(dplyr)
data %>% 
  group_by(Group) %>%
  summarise(num_verbs = count_verbs(Description))