得到单词出现

时间:2014-04-10 18:43:32

标签: r machine-learning

我试图用r来获取csv文件中每个单词的出现次数。 我的数据集如下所示:

                                        TITLE
1                                       My first Android app after a year
2                                 Unmanned drone buzzes French police car
3                                       Make anything editable with HTML5
4                                          Predictive vs Reactive control
5 What was it like to move to San Antonio and go through TechStars Cloud?
6               Health-care sector vulnerable to hackers, researchers say

我尝试使用“黑客机器学习”中使用的功能:

get.tdm <- function(doc.vec) {
            doc.corpus <- Corpus(VectorSource(doc.vec))
            control <- list(stopwords=TRUE, removePunctuation=TRUE, removeNumbers=TRUE, minDocFreq=2)
            doc.dtm <- TermDocumentMatrix(doc.corpus, control) 
            return(doc.dtm)
}

但我得到一个我不理解的错误:

Error: is.Source(s) is not TRUE
In addition: Warning message:
In is.Source(s) : vectorized sources must have a positive length entry

可能出现什么问题?

1 个答案:

答案 0 :(得分:1)

这对我有用(调用您的数据框df

library(tm)
doc.corpus <- Corpus(VectorSource(df))
freq <- data.frame(count=termFreq(doc.corpus[[1]]))
freq
#             count
# after           1
# and             1
# android         1
# antonio         1
# anything        1
# ...
# unmanned        1
# vulnerable      1
# was             1
# what            1
# with            1
# year            1