文本挖掘 - 计算短语的频率(多个单词)

时间:2017-04-19 12:53:26

标签: r nlp text-mining n-gram

我熟悉使用tm库来创建tdm并计算术语的频率。

但这些术语都是单字。

如何计算文字和/或语料库中出现多字短语的次数?

编辑:

我正在添加我现在的代码以改进/澄清我的帖子。

这是构建术语 - 文档矩阵的非常标准的代码:

library(tm)


cname <- ("C:/Users/George/Google Drive/R Templates/Gospels corpus")   

corpus <- Corpus(DirSource(cname))

#Cleaning
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, c("a","the","an","that","and"))

#convert to a plain text file
corpus <- tm_map(corpus, PlainTextDocument)

#Create a term document matrix
tdm1 <- TermDocumentMatrix(corpus)

m1 <- as.matrix(tdm1)
word.freq <- sort(rowSums(m1), decreasing=T)
word.freq<-word.freq[1:100]

问题是这会返回单个词术语的矩阵,例如:

  all      into      have      from      were       one      came       say       out 
  397       390       385       383       350       348       345       332       321

我希望能够在语料库中搜索多字词。例如&#34;来自&#34;而不只是&#34;来了#34; &#34;来自&#34;分开。

谢谢。

3 个答案:

答案 0 :(得分:0)

鉴于文字:

text <- "This is my little R text example and I want to count the frequency of some pattern (and - is - my - of). This is my little R text example and I want to count the frequency of some patter."

查找单词的频率:

table(strsplit(text, ' '))


   -      (and       and     count   example frequency         I        is    little        my 
    3         1         2         2         2         2         2         3         2         3 
   of      of).   patter.   pattern         R      some      text       the      This        to 
    2         1         1         1         2         2         2         2         2         2 
 want 
    2 

对于模式的频率:

attr(regexpr('is', text), "match.length")

[1] 3

答案 1 :(得分:0)

我创建了以下用于获取单词n-gram及其相应频率的函数

library(tau) 
library(data.table)
# given a string vector and size of ngrams this function returns     word ngrams with corresponding frequencies
createNgram <-function(stringVector, ngramSize){

  ngram <- data.table()

  ng <- textcnt(stringVector, method = "string", n=ngramSize, tolower = FALSE)

  if(ngramSize==1){
    ngram <- data.table(w1 = names(ng), freq = unclass(ng), length=nchar(names(ng)))  
  }
  else {
    ngram <- data.table(w1w2 = names(ng), freq = unclass(ng), length=nchar(names(ng)))
  }
  return(ngram)
}

给出类似

的字符串
text <- "This is my little R text example and I want to count the frequency of some pattern (and - is - my - of). This is my little R text example and I want to count the frequency of some patter."

以下是如何为一对单词调用函数,对于长度为3的短语3作为参数

res <- createNgram(text, 2)

打印res输出

           w1w2      freq   length
 1:        I want    2      6
 2:        R text    2      6
 3:       This is    2      7
 4:         and I    2      5
 5:        and is    1      6
 6:     count the    2      9
 7:   example and    2     11
 8:  frequency of    2     12
 9:         is my    3      5
10:      little R    2      8
11:     my little    2      9
12:         my of    1      5
13:       of This    1      7
14:       of some    2      7
15:   pattern and    1     11
16:   some patter    1     11
17:  some pattern    1     12
18:  text example    2     12
19: the frequency    2     13
20:      to count    2      8
21:       want to    2      7

答案 2 :(得分:0)

以下是使用Tidytext的代码的一个很好的示例:https://www.kaggle.com/therohk/news-headline-bigrams-frequency-vs-tf-idf

相同的技术可以扩展到更大的n值。

bigram_tf_idf <- bigrams %>%
  count(year, bigram) %>%
  filter(n > 2) %>%
  bind_tf_idf(bigram, year, n) %>%
  arrange(desc(tf_idf))

bigram_tf_idf.plot <- bigram_tf_idf %>%
  arrange(desc(tf_idf)) %>%
  filter(tf_idf > 0) %>%
  mutate(bigram = factor(bigram, levels = rev(unique(bigram))))

bigram_tf_idf.plot %>% 
  group_by(year) %>% 
  top_n(10) %>% 
  ungroup %>%
  ggplot(aes(bigram, tf_idf, fill = year)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~year, ncol = 3, scales = "free") +
  theme(text = element_text(size = 10)) +
  coord_flip()