我熟悉使用tm库来创建tdm并计算术语的频率。
但这些术语都是单字。
如何计算文字和/或语料库中出现多字短语的次数?
编辑:
我正在添加我现在的代码以改进/澄清我的帖子。
这是构建术语 - 文档矩阵的非常标准的代码:
library(tm)
cname <- ("C:/Users/George/Google Drive/R Templates/Gospels corpus")
corpus <- Corpus(DirSource(cname))
#Cleaning
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, c("a","the","an","that","and"))
#convert to a plain text file
corpus <- tm_map(corpus, PlainTextDocument)
#Create a term document matrix
tdm1 <- TermDocumentMatrix(corpus)
m1 <- as.matrix(tdm1)
word.freq <- sort(rowSums(m1), decreasing=T)
word.freq<-word.freq[1:100]
问题是这会返回单个词术语的矩阵,例如:
all into have from were one came say out
397 390 385 383 350 348 345 332 321
我希望能够在语料库中搜索多字词。例如&#34;来自&#34;而不只是&#34;来了#34; &#34;来自&#34;分开。
谢谢。
答案 0 :(得分:0)
鉴于文字:
text <- "This is my little R text example and I want to count the frequency of some pattern (and - is - my - of). This is my little R text example and I want to count the frequency of some patter."
查找单词的频率:
table(strsplit(text, ' '))
- (and and count example frequency I is little my
3 1 2 2 2 2 2 3 2 3
of of). patter. pattern R some text the This to
2 1 1 1 2 2 2 2 2 2
want
2
对于模式的频率:
attr(regexpr('is', text), "match.length")
[1] 3
答案 1 :(得分:0)
我创建了以下用于获取单词n-gram及其相应频率的函数
library(tau)
library(data.table)
# given a string vector and size of ngrams this function returns word ngrams with corresponding frequencies
createNgram <-function(stringVector, ngramSize){
ngram <- data.table()
ng <- textcnt(stringVector, method = "string", n=ngramSize, tolower = FALSE)
if(ngramSize==1){
ngram <- data.table(w1 = names(ng), freq = unclass(ng), length=nchar(names(ng)))
}
else {
ngram <- data.table(w1w2 = names(ng), freq = unclass(ng), length=nchar(names(ng)))
}
return(ngram)
}
给出类似
的字符串text <- "This is my little R text example and I want to count the frequency of some pattern (and - is - my - of). This is my little R text example and I want to count the frequency of some patter."
以下是如何为一对单词调用函数,对于长度为3的短语3作为参数
res <- createNgram(text, 2)
打印res
输出
w1w2 freq length
1: I want 2 6
2: R text 2 6
3: This is 2 7
4: and I 2 5
5: and is 1 6
6: count the 2 9
7: example and 2 11
8: frequency of 2 12
9: is my 3 5
10: little R 2 8
11: my little 2 9
12: my of 1 5
13: of This 1 7
14: of some 2 7
15: pattern and 1 11
16: some patter 1 11
17: some pattern 1 12
18: text example 2 12
19: the frequency 2 13
20: to count 2 8
21: want to 2 7
答案 2 :(得分:0)
以下是使用Tidytext的代码的一个很好的示例:https://www.kaggle.com/therohk/news-headline-bigrams-frequency-vs-tf-idf
相同的技术可以扩展到更大的n值。
bigram_tf_idf <- bigrams %>%
count(year, bigram) %>%
filter(n > 2) %>%
bind_tf_idf(bigram, year, n) %>%
arrange(desc(tf_idf))
bigram_tf_idf.plot <- bigram_tf_idf %>%
arrange(desc(tf_idf)) %>%
filter(tf_idf > 0) %>%
mutate(bigram = factor(bigram, levels = rev(unique(bigram))))
bigram_tf_idf.plot %>%
group_by(year) %>%
top_n(10) %>%
ungroup %>%
ggplot(aes(bigram, tf_idf, fill = year)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~year, ncol = 3, scales = "free") +
theme(text = element_text(size = 10)) +
coord_flip()