使用tm包删除停用词(Gsub错误)

时间:2016-07-20 17:52:28

标签: r gsub tm stop-words

我正在尝试删除我从语料库创建的停用词列表。我不知道发生了什么,因为我从停用词列表中删除了所有特殊字符并完成了语料库中的文本清理。任何帮助将不胜感激。代码和错误消息如下。这里列出了具有用户定义的停用词的csv: Stop Words

    myCorpus <- Corpus(VectorSource(c("blank", "blank", "blank", "blank", "blank", "blank", "blank", 
"blank", "blank", "blank", "blank", "blank", "blank", "<br />Key skills:<br />Octopus Deploy, MS Build, PowerShell, Azure, NuGet, CI / CD concepts, release management<br /><br /> * Minimum 5 years plus relevant experience in Application Development lifecycle, Automation and Release and Configuration Management<br /> * Considerable experience in the following disciplines - TFS (Team Foundation Server), DevOps, Continuous Delivery, Release Engineering, Application Architect, Database Architect, Information Modeling, Service Oriented Architecture (SOA), Quality Assurance, Branch Management, Network setup and troubleshooting, Server setup, configuration, maintenance and patching<br /> * Solid understanding of Software Development Life Cycle, Test Driven Development, Continuous Integration and Continuous Delivery<br /> * Solid understanding and experience working with high availability and high performance, multi-data center systems and hybrid cloud environments.<br /> * Proficient with Agile methodologies and working closely within small teams and vendors<br /> * Knowledge of Deployment and configuration automation platforms<br /> * Extensive PowerShell experience<br /> * Extensive knowledge of Windows based systems including hardware, software and .NET applications<br /> * Strong ability to troubleshoot complex issues ranging from system resources to application stack traces<br /><br />REQUIRED SKILLS:<br />Bachelor's degree & 5-10 years of relevant work experience.", 
    "blank")))

for (j in seq(myCorpus)) {
  myCorpus[[j]] <- gsub("<.*>", " ", myCorpus[[j]])
  myCorpus[[j]] <- gsub("\\b[[:alnum:]]{20,}\\b", " ", myCorpus[[j]], perl=T)
  myCorpus[[j]] <- gsub("[[:punct:]]", " ", myCorpus[[j]])
}

#Clean Corpus
myCorpus <- tm_map(myCorpus, PlainTextDocument)
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)
myCorpus <- tm_map(myCorpus, stripWhitespace)

#User defined stop word
manualStopwords <- read.csv("r_stop.csv", header = TRUE)
myStopwords <- paste(manualStopwords[,1])
myStopwords <- str_replace_all(myStopwords, "[[:punct:]]", "")
myStopwords <- gsub("\\+", "plus", myStopwords)
myStopwords <- gsub("\\$", "dollars", myStopwords)

myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

第一次错误

  

gsub错误(sprintf(&#34;(* UCP)\ b(%s)\ b&#34;,粘贴(排序(单词,减少= TRUE),:     无效的正则表达式&#39;(* UCP)\ b(zimmermann | yrs | yr | youve | .....其余的停止词

其他错误

  

另外:警告信息:   在gsub中(sprintf(&#34;(* UCP)\ b(%s)\ b&#34;)粘贴(排序(单词,减少= TRUE),:     PCRE模式编译错误       &#39;正则表达式太大&#39;       在&#39;&#39;

1 个答案:

答案 0 :(得分:1)

我能够将我的停止词分成更小的桶并且代码运行了。内存可能存在问题。

chunk <- 500
n <- length(myStopwords)
r <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(myStopwords,r)

for (i in 1:length(d)) {
  myCorpus <- tm_map(myCorpus, removeWords, c(paste(d[[i]])))
}