R动态停用词列表,其频率为

时间:2015-07-06 18:00:25

标签: r text-mining tm stop-words

我正在进行文本挖掘工作,目前陷入困境。以下内容基于Zhaos Text Mining与Twitter。我不能让它工作,也许你们其中一个人有个好主意?

目标我想删除语料库中的所有字词,字数为1,而不是使用停用词列表。

到目前为止我做了什么:我已下载推文并将其转换为数据框。

tf1 <- Corpus(VectorSource(tweets.df$text))


tf1 <- tm_map(tf1, content_transformer(tolower))


removeUser <- function(x) gsub("@[[:alnum:]]*", "", x)
tf1 <- tm_map(tf1, content_transformer(removeUser))


removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
tf1 <- tm_map(tf1, content_transformer(removeNumPunct))


removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
tf1 <- tm_map(tf1, content_transformer(removeURL))

tf1 <- tm_map(tf1, stripWhitespace)


#Using TermDocMatrix in order to find terms with count 1, dont know any other way
tdmtf1 <- TermDocumentMatrix(tf1, control = list(wordLengths = c(1, Inf)))

ones <- findFreqTerms(tdmtf1, lowfreq = 1, highfreq = 1)

tf1Copy <- tf1

tf1List <- setdiff(tf1Copy, ones)


tf1CList <- paste(unlist(tf1List),sep="", collapse=" ")

tf1Copy <- tm_map(tf1Copy, removeWords, tf1CList)

tdmtf1Test <- TermDocumentMatrix(tf1Copy, control = list(wordLengths = c(1, Inf)))

#Just to test success...
ones2 <- findFreqTerms(tdmtf1Test, lowfreq = 1, highfreq = 1)
(ones2)

错误:

  

gsub中的错误(sprintf(&#34;(* UCP)\ b(%s)\ b&#34;,粘贴(排序(单词,减少= TRUE),:无效的正则表达式&#39;(* UCP)\ b(高级数据科学家全球战略公司)   25.0010230541229 48 17 6 6 115 1 186 0 1 zh kdnuggets轮询分析数据挖掘数据科学的主要编程语言
  25.0020229816437 48 17 6 6 115 1 186 0 2 en iapa堪培拉研讨会在互联网上挖掘信息时代的所有官方统计数据anu 6月号25.0020229816437 48 17 6 6 115 1 186 0 3在电子书中处理和处理字符串pdf格式页面
  25.0020229816437 48 17 6 6 115 1 186 0 4网络研讨会由hadley wickham于6月6日将您的数据输入r   25.0020229816437 48 17 6 6 115 1 186 0 5在加载rdmtweets数据集之前,请运行librarytwitter以加载所需的软件包
  25.0020229816437 48 17 6 6 115 1 186 0 6关于sas vs r与python datascience的信息图表   25.0020229816437 48 17 6 6 115 1 186 0 7再次对顶级分析数据挖掘科学软件进行kdnuggets民意调查   25.0020229816437 48 17 6 6 115 1 186 0 8我将运行

另外:

  

警告消息:在gsub中(sprintf(&#34;(* UCP)\ b(%s)\ b&#34;)粘贴(排序(单词,减少= TRUE),: PCRE模式编译错误           &#39;正则表达式太大&#39;           在&#39;&#39;

PS抱歉,最后的格式不好无法修复。

2 个答案:

答案 0 :(得分:1)

以下是如何从字库中删除所有字词数为1的方法:

library(tm)
mytweets <- c("This is a doc", "This is another doc")

corp <- Corpus(VectorSource(mytweets))
inspect(corp)
# [[1]]
# <<PlainTextDocument (metadata: 7)>>
# This is a doc
# 
# [[2]]
# <<PlainTextDocument (metadata: 7)>>
#   This is another doc
##            ^^^ 

dtm <- DocumentTermMatrix(corp)
inspect(dtm)
# Terms
# Docs another doc this
# 1       0   1    1
# 2       1   1    1

(stopwords <- findFreqTerms(dtm, 1, 1))
# [1] "another"

corp <- tm_map(corp, removeWords, stopwords)
inspect(corp)
# [[1]]
# <<PlainTextDocument (metadata: 7)>>
# This is a doc
# 
# [[2]]
# <<PlainTextDocument (metadata: 7)>>
# This is  doc
##        ^ 'another' is gone

(作为旁注:标记&#39; a&#39;来自&#39;这是......也已消失,因为DocumentTermMatrix已删除默认情况下长度<3的标记。)

答案 1 :(得分:0)

使用quanteda包中的dfm()trim()函数,这是一种更简单的方法:

require(quanteda)

mydfm <- dfm(c("This is a doc", "This is another doc"), verbose = FALSE)
mydfm
## Document-feature matrix of: 2 documents, 5 features.
## 2 x 5 sparse Matrix of class "dfmSparse"
## features
## docs    a another doc is this
## text1 1       0   1  1    1
## text2 0       1   1  1    1

trim(mydfm, minCount = 2)
## Features occurring less than 2 times: 2 
## Document-feature matrix of: 2 documents, 3 features.
## 2 x 3 sparse Matrix of class "dfmSparse"
## features
## docs    doc is this
## text1   1  1    1
## text2   1  1    1