我有很多推文作为文字。
我想知道特定单词后的单词频率。 例如,我有这些推文,我想知道"爱"之后的频率:
My love is...
My love is...
the love was...
the love were...
得到这个结果:
word next word frequency
Love is 2
Love was 1
Love were 1
或所有单词
word next word frequency
My Love 2
the love 2
Love is 2
Love was 1
Love were 1
答案 0 :(得分:2)
以下程序可能有所帮助。
Step1(可选):创建一些示例数据
example <- c("my love is","my love is","banana","apple","the love was","the love were")
此向量看起来像
"my love is" "my love is" "banana" "apple" "the love was" "the love were"
步骤2:获取包含单词&#34; love&#34;
的所有矢量条目ex2 <- example[grep("love",example)]
给你
"my love is" "my love is" "the love was" "the love were"
步骤3:构建一个单词表格,这个单词出现在&#34; love&#34;
之后ex3 <- table(gsub(".*love","",ex2))
给你
is was were
2 1 1
答案 1 :(得分:2)
当你处理几个单词组合(第一个X秒)时,我没有看到任何避免使用循环的方法。下面的功能应该做你想要的:
phrase <- c("My love is... ","My love is...","A love was...","the dogs were...")
SPLIT <- matrix(unlist(strsplit(phrase," ")),nrow=length(phrase),byrow=T)
vect <- as.data.frame(cbind(unique(expand.grid(SPLIT[,1],SPLIT[,2])),freq=NA))
to.find <- paste(vect[,1],vect[,2],sep=" ")
for (i in 1:length(to.find)) {
vect[i,3] <- length(grep(to.find[i],phrase))}
vect <- subset(vect,freq>0)
vect
vect
Var1 Var2 freq
1 My love 2
3 A love 1
16 the dogs 1