使用n-gram的r中的字频率计数器

时间:2016-04-08 09:13:25

标签: r frequency text-mining word

我想执行某项操作,该操作将以提供的格式转换数据。

输入

colA                             colB
textA textB textC textD           m
textA textB                       n
textB textC                       p
textB textC textD                 q

输出

type    col_a              col_b(frequency)           col_c
unigram textA                        2                  m+n
unigram textB                        4                m+n+p+q
unigram textC                        3                 m+p+q
unigram textD                        2                  m+q
bigram  textA textB                  2                  m+n
bigram  textB textC                  3                 m+p+q
bigram  textC textD                  2                  m+q
trigram textA textB textC            1                   m
trigram textB textC textD            2                   m+q
fourgram textA textB textC textD     1                   m

需要单独为unigram,bigram,trigram和fourgram执行此操作,然后对结果进行rbind。

1 个答案:

答案 0 :(得分:3)

这是一个想法

n_grams <- function(n) {
  unigrams1 <- unique(unlist(lapply(strsplit(df$colA, ' '), unique)))
  t <- apply(combn(unigrams1, n), 2, paste, collapse = ' ')
  t1 <- sapply(t, function(i) paste(df$colB[grepl(i, df$colA)], collapse = '+'))
  return(t1[sapply(t1, nchar)>0])
}
#testing the function

n_grams(1)
#    textA     textB     textC     textD 
#    "m+n" "m+n+p+q"   "m+p+q"     "m+q" 
n_grams(2)
#textA textB textB textC textC textD 
#      "m+n"     "m+p+q"       "m+q" 
n_grams(3)
#textA textB textC textB textC textD 
#              "m"             "m+q" 
n_grams(4)
#textA textB textC textD 
#                    "m" 

构建所需的输出,然后

df1 <- data.frame(rbind(stack(n_grams(1)), stack(n_grams(2)), stack(n_grams(3)), stack(n_grams(4))))
df1$freq <- nchar(gsub('\\+', '', df1$values))
df1 <- df1[,c('ind', 'freq', 'values')]
df1
#                       ind freq  values
#1                    textA    2     m+n
#2                    textB    4 m+n+p+q
#3                    textC    3   m+p+q
#4                    textD    2     m+q
#5              textA textB    2     m+n
#6              textB textC    3   m+p+q
#7              textC textD    2     m+q
#8        textA textB textC    1       m
#9        textB textC textD    2     m+q
#10 textA textB textC textD    1       m