空术语文档矩阵

时间:2016-05-28 10:41:29

标签: r twitter term-document-matrix

每当我尝试检查我的频率时,我似乎遇到了问题。单词和协会。

当我制作tdm时,我得到这个信息: TermDocumentMatrix

在大量文档中,我可以看到我有足够的术语可供使用。 然而!

当我尝试检查“tdm”的内容时,我收到以下信息: Inspecting the TDM

如果tdm突然变空了?

希望有人可以提供帮助

tweets <- userTimeline("RDataMining", n = 1000)

(n.tweet <- length(tweets))
tweets[1:3]

#convert tweets to a data frame
tweets.df <- twListToDF(tweets)
dim(tweets.df)


##Text cleaning
library(tm)
#build a corpus and specify the source to be a character vector
myCorpus <- Corpus(VectorSource(tweets.df$text))

#convert to lower case
myCorpus <- tm_map(myCorpus, content_transformer(tolower)) 

#remove URLs
removeURL <- function(x) gsub ("http[^[:space:]]*","",x) 
myCorpus <- tm_map(myCorpus,content_transformer(removeURL))

#remove anything other than English letters or space
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*","",x)
myCorpus <- tm_map(myCorpus,content_transformer(removeNumPunct))

#remove stopwords + 2
myStopwords <- c(stopwords('english'),"available","via")
#remove "r" and "big" from stopwords
myStopwords <- setdiff(myStopwords, c("r","big"))
#remove stopwords from corpus
myCorpus <- tm_map(myCorpus,removeWords,myStopwords)
#remove extra whitespace
myCorpus <- tm_map(myCorpus, stripWhitespace)

#keep a copy of corpus to use later as a dictionary for stem completion
myCorpusCopy <- myCorpus

#stem words
library(SnowballC)
myCorpus <- tm_map(myCorpus,stemDocument)
stemCompletion2 <- function(x,dictionary) {
x <- unlist(strsplit(as.character(x),""))

#because stemCompletion completes an empty string to a word in dict. Remove empty string to avoid this

 x <- x[x !=""]
 x <- stemCompletion(x, dictionary = dictionary)
 x <- paste (x,sep = "",collapse = "")
 PlainTextDocument(stripWhitespace(x))
}

myCorpus <- lapply(myCorpus, stemCompletion2, dictionary = myCorpusCopy)
myCorpus <- Corpus(VectorSource(myCorpus))

#count freq of "mining"
miningCases <- lapply(myCorpusCopy,
                  function(x) {grep(as.character(x),pattern = "\\<mining")})
sum(unlist(miningCases))

#count freq of "miner"
miningCases <- lapply(myCorpusCopy,
                  function(x) {grep(as.character(x),pattern = "\\<miner")})
sum(unlist(miningCases))

#count freq of "r"
miningCases <- lapply(myCorpusCopy,
                  function(x) {grep(as.character(x),pattern = "\\<r")})
sum(unlist(miningCases))

#replace "miner" with "mining"
myCorpus <- tm_map(myCorpus,content_transformer(gsub),
               pattern = "miner", replacement = "mining")

tdm <- TermDocumentMatrix(myCorpus, control = list(removePunctuation =    TRUE,stopwords = TRUE))
tdm

##Freq words and associations
idx <- which(dimnames(tdm)$Terms == "r")
inspect(tdm[idx + (0:5), 101:110])

#inspect frequent words
(freq.terms <- findFreqTerms(tdm, lowfreq = 15))
term.freq <- rowSums(as.matrix(tdm))
term.freq <- subset(term.freq,term.freq >= 15)
df <- data.frame(term = names(term.freq), freq = term.freq)

1 个答案:

答案 0 :(得分:0)

我一直在使用以下Twitter查询来测试您的代码:

tweets = searchTwitter("r data mining", n=10)

我觉得问题出在你的函数stemCompletion2上,它应该是这样的:

stemCompletion2 <- function(x,dictionary) {
  x <- unlist(strsplit(as.character(x)," "))
  print("before:")
  print(x)

  #because stemCompletion completes an empty string to a word in dict. Remove empty string to avoid this
  x <- x[x !=""]
  x <- stemCompletion(x, dictionary = dictionary)
  print("after:")
  print(x)
  x <- paste(x, sep = " ")
  PlainTextDocument(stripWhitespace(x))
}

修改如下:在您

之前
x <- unlist(strsplit(as.character(x),""))

创建了一个列表,其中包含每个文档中的所有字符,并且我已将其修改为

x <- unlist(strsplit(as.character(x)," "))

创建单词列表。同样,在重新组合文档时,您可以在哪里进行

x <- paste (x,sep = "",collapse = "")

创建了你在帖子中提到的长字符串,我已将其修改为:

x <- paste(x, sep = " ")

重新组合单词。

完成的一个例子是我的数据:

[1] "before:"
 [1] "rt"             "ebookdealalert" "r"              "datamin"        "project"        "learn"          "data"           "mine"          
 [9] "realworld"      "project"        "book"           "solv"           "predict"        "model"         
[1] "after:"
               rt    ebookdealalert                 r           datamin           project             learn              data              mine 
             "rt" "ebookdealalerts"               "r"      "datamining"        "projects"           "learn"            "data"                "" 
        realworld           project              book              solv           predict             model 
      "realworld"        "projects"            "book"           "solve"      "predictive"        "modeling"

完成该步骤后,您可以按预期使用TermDocumentMatrix

希望它有所帮助。