使用mutate获取ngram的数量

时间:2016-04-15 09:26:27

标签: r nlp dplyr stringr

我正在使用dplyr来解析包含句子的列并计算每个句子的ngrams数。这是一个展示我遇到的问题的例子。

如你所见,人们希望ngram_cnt为3和4,但它会产生一个有3,3的列。问题是代码返回第一句的ngrams数,忽略其余的。您可以尝试添加更多句子,具有相同的效果。我做错了什么?

library(NLP)
library(dplyr)
library(stringr)

phrases <- c("this is the first", "and then comes the second")
df <- data.frame(phrase = phrases, id = c(1, 2))
df %>% mutate(ngram_cnt = length(ngrams(str_split(phrase, "\\s")[[1]], 2)))

如果我说,

phrases <- c("this is the first", "and then comes the second",
             "and the third which is even longer")
df <- data.frame(phrase = phrases, id = c(1, 2, 3))
df %>% mutate(ngram_cnt = str_length(phrase))

然后我得到了预期的结果(即每个句子的长度)。

1 个答案:

答案 0 :(得分:2)

那是因为

df %>% mutate(ngram_cnt = length(ngrams(str_split(phrase, "\\s")[[1]], 2)))

[[1]]仅选择第一句中的分割 这与:

相同
length(ngrams(str_split(phrases, "\\s")[[1]], 2))
# [1] 3

mutate3放入每一行

之后
phrases <- c("this is the first", "and then comes the second")
df <- data.frame(phrase = phrases, id = c(1, 2))
library("dplyr")

您可以使用rowwise按行计算您的计算:

df %>% rowwise() %>% mutate(ngram_cnt = length(ngrams(str_split(phrase, "\\s")[[1]], n = 2)))
# Source: local data frame [2 x 3]
# Groups: <by row>
# 
#                      phrase    id ngram_cnt
#                      (fctr) (dbl)     (int)
# 1         this is the first     1         3
# 2 and then comes the second     2         4

如果您的ID是唯一的,则使用group_by

df %>% group_by(id) %>% mutate(ngram_cnt = length(ngrams(str_split(phrase, "\\s")[[1]], n = 2)))
# Source: local data frame [2 x 3]
# Groups: id [2]
# 
#                      phrase    id ngram_cnt
#                      (fctr) (dbl)     (int)
# 1         this is the first     1         3
# 2 and then comes the second     2         4

或者你可以矢量化计算ngrams长度的函数:

length_ngrams <- function(x) {
  length(ngrams(str_split(x, "\\s")[[1]], n = 2))
}
length_ngrams <- Vectorize(length_ngrams)
df %>% mutate(ngram_cnt = length_ngrams(phrase))
#                      phrase id ngram_cnt
# 1         this is the first  1         3
# 2 and then comes the second  2         4