R中的数以百万计的小火柴:需要表现

时间:2015-10-04 11:00:11

标签: r performance join position match

我有一个长度为100万的单词矢量名为WORDS。我有一个名为SENTENCES的9百万对象列表。我列表中的每个对象都是一个由10-50长度的单词向量表示的句子。这是一个例子:

head(WORDS)
[1] "aba" "accra" "ada" "afrika" "afrikan" "afula" "aggamemon"

SENTENCES[[1]]
[1] "how" "to" "interpret" "that" "picture"

我想将列表中的每个句子转换为数字向量,其元素对应于WORDS大向量中句子单词的位置。 实际上,我知道如何使用该命令:

convert <- function(sentence){
  return(which(WORDS %in% sentence))
}

SENTENCES_NUM <- lapply(SENTENCES, convert)

问题在于它需要太长时间。我的意思是我的RStudio爆炸了,虽然我有一台16Go RAM计算机。所以问题是你有什么想法来加速计算吗?

2 个答案:

答案 0 :(得分:3)

fastmatch是R核心人员的一个小包,它会对查询进行哈希处理,以便初始搜索和后续搜索更快。

你真正在做的是制作一个具有每个句子共有的预定义级别的因子。他的C代码中的缓慢步骤是对因子级别进行排序,通过为因子函数的快速版本提供(唯一的)因子级别列表,可以避免这种情况。

如果你只想要整数位置,你可以很容易地从因子转换为整数:很多人不经意地这样做。

您实际上并不需要任何因素,只需要match。你的代码也会生成一个逻辑向量,然后重新计算它的位置:match直接进入这些位置。

library(fastmatch)
library(microbenchmark)

WORDS <- read.table("https://dotnetperls-controls.googlecode.com/files/enable1.txt", stringsAsFactors = FALSE)[[1]]

words_factor <- as.factor(WORDS)

# generate 100 sentences of between 5 and 15 words:
SENTENCES <- lapply(c(1:100), sample, x = WORDS, size = sample(c(5:15), size = 1))

bench_fun <- function(fun)
  lapply(SENTENCES, fun)

# poster's slow solution:
hg_convert <- function(sentence)
  return(which(WORDS %in% sentence))

jw_convert_match <- function(sentence) 
  match(sentence, WORDS)

jw_convert_match_factor <- function(sentence) 
  match(sentence, words_factor)

jw_convert_fastmatch <- function(sentence) 
  fmatch(sentence, WORDS)

jw_convert_fastmatch_factor <- function(sentence)
  fmatch(sentence, words_factor)

message("starting benchmark one")
print(microbenchmark(bench_fun(hg_convert),
                     bench_fun(jw_convert_match),
                     bench_fun(jw_convert_match_factor),
                     bench_fun(jw_convert_fastmatch),
                     bench_fun(jw_convert_fastmatch_factor),
                     times = 10))

# now again with big samples
# generating the SENTENCES is quite slow...
SENTENCES <- lapply(c(1:1e6), sample, x = WORDS, size = sample(c(5:15), size = 1))
message("starting benchmark two, compare with factor vs vector of words")
print(microbenchmark(bench_fun(jw_convert_fastmatch),
                     bench_fun(jw_convert_fastmatch_factor),
                     times = 10))

我把它放在https://gist.github.com/jackwasey/59848d84728c0f55ef11

结果格式不是很好,足以说,有或没有因子输入的fastmatch显着更快。

# starting benchmark one
Unit: microseconds
                                   expr         min          lq         mean      median          uq         max neval
                  bench_fun(hg_convert)  665167.953  678451.008  704030.2427  691859.576  738071.699  777176.143    10
            bench_fun(jw_convert_match)  878269.025  950580.480  962171.6683  956413.486  990592.691 1014922.639    10
     bench_fun(jw_convert_match_factor) 1082116.859 1104331.677 1182310.1228 1184336.810 1198233.436 1436600.764    10
        bench_fun(jw_convert_fastmatch)     203.031     220.134     462.1246     289.647     305.070    2196.906    10
 bench_fun(jw_convert_fastmatch_factor)     251.474     300.729    1351.6974     317.439     362.127   10604.506    10

# starting benchmark two, compare with factor vs vector of words
Unit: seconds
                                   expr      min       lq     mean   median       uq      max neval
        bench_fun(jw_convert_fastmatch) 3.066001 3.134702 3.186347 3.177419 3.212144 3.351648    10
 bench_fun(jw_convert_fastmatch_factor) 3.012734 3.149879 3.281194 3.250365 3.498593 3.563907    10

因此,我现在还不会遇到并行实现的麻烦。

答案 1 :(得分:-1)

不会更快,但它是处理事情的整洁方式。

library(dplyr)
library(tidyr)

sentence = 
  data_frame(word.name = SENTENCES,
             sentence.ID = 1:length(SENTENCES) %>%
  unnest(word.name)

word = data_frame(
  word.name = WORDS,
  word.ID = 1:length(WORDS)

sentence__word = 
  sentence %>%
  left_join(word)