加速R中的文本挖掘(和循环)

时间:2016-02-14 18:50:33

标签: r for-loop text frequency text-mining

我正在对数千个文档进行文本挖掘(基本上是在进行频率计数),并且想知道是否还有其他方法可以加快以下过程?目前,运行整个分析需要10个多小时。谢谢(来自R初学者)。

sessionInfo() 
#R version 3.2.3 (2015-12-10)

library(bitops)
library(RCurl)
library(XML)
library(stringr)
library(tm)

setwd("F:/testing_folder")
path = "F:/testing_folder" 

file.names <- dir(path, pattern =".txt") 
filename <- vector()
totalword <- vector()

system.time(
  for(i in 1:length(file.names)){
    text.v <- scan(file.names[i], what="character", sep="\n",encoding = "UTF-8") 
    report.v <- paste(text.v, collapse=" " ) 

    #Count total number of words
    words.l <- strsplit(report.v, "\\W") 
    word.v <- unlist(words.l) 
    not.blanks.v <- which(word.v!="") 
    word.v <- word.v[not.blanks.v] 
    totalword <- append(totalword,length(word.v)) 

    filename <- append(filename,print(file.names[i])) 
    x <- data.frame(filename,totalword)  
    write.csv(x, file= "results.csv") #export results
  }
)

1 个答案:

答案 0 :(得分:0)

你从以下内容得到什么?

Rprof("profile1.out", line.profiling=TRUE)
source("http://pastebin.com/raw/kFGCse5s")
Rprof(NULL)
proftable("profile1.out", lines=10)