Question

我在R中使用tm和Snowball包进行文本挖掘。我最初在具有8 GB内存的Windows 7的笔记本电脑上运行它。后来我在具有64 GB内存的Linux（Ubuntu）机器上尝试了相同的操作。这两台机器都是64位，我也使用64位版本的R。但是， Windows有R 3.0.0而Linux有R 2.14

与Windows相比，Linux中的某些命令非常慢。

语料库命令

在Windows上

    d <- data.frame(chatTranscripts$chatConcat)
    ds <- DataframeSource(d)
    t1 <- Sys.time()
    dsc<-Corpus(ds)
    print(Sys.time() - t1)
    Time difference of 46.86169 secs

Windows机器上只用了47秒

在Linux上

    t1 <- Sys.time()
    dsc<-Corpus(ds)
    print(Sys.time() - t1)
    Time difference of 3.674376 mins

这在Linux机器上花了 220秒

Snowball Stemming

在Windows上

    t1 <- Sys.time()
    dsc <- tm_map(dsc,stemDocument)
    print(Sys.time() - t1)
    Time difference of 12.05321 secs

在Windows计算机

上仅 12秒

在Linux上

    t1 <- Sys.time()
    dsc <- tm_map(dsc,stemDocument)
    print(Sys.time() - t1)
    Time difference of 4.832964 mins

这在Linux机器上花了 290秒

有没有办法在Linux机器上加速这些命令？ R版本会产生如此大的差异吗？谢谢。

拉维

Answer 1

Corpus()上的{p> VectorSource()似乎比Corpus()上的DataframeSource()更快。

你可以尝试

d <- chatTranscripts$chatConcat
ds <- VectorSource(d)
Corpus(ds)

Linux中的tm和Snowball包命令速度很慢

1 个答案: