提高在大型字符串向量上计算字分数之和的性能?

时间:2017-04-22 23:45:21

标签: r string performance loops vectorization

我有一个如下字符串:

list.add(new Student(" Bourne", "70","\tCOP2250, ENC3250, COP3530"));
list.add(new Student(" Gracia", "50","\tCOP2250, COP3250, COP4250"));


Output:

COP2250 - 2
COP3530 - 1
ENC3250 - 1

和一个数据框,为每个单词分配一个分数:

 [1] "What can we learn from the Mahabharata "                                                                
 [2] "What are the most iconic songs associated with the Vietnam War "                                        
 [3] "What are some major social faux pas to avoid when visiting Malta "                                      
 [4] "Will Ready Boost technology contribute to CFD software usage "                                          
 [5] "Who is Jon Snow " ...

我想为每个字符串分配其中包含的单词的分数总和,我的解决方案是以下函数

   word score
   the    11
    to     9
  What     9
     I     7
     a     6
   are     6

我遇到的问题是性能问题,我有大约500k的字符串,超过一百万字,在我的I-7,16GB机器上使用该功能需要一个多小时。 此外,解决方案只是感觉不雅,笨重..

有更好(更有效)的解决方案吗?

重现数据:

 score_fun<- function(x)

 # obtaining the list of words 

 {z <- unlist(strsplit(x,' ')); 

 # returning the sum of the words' scores     

 return(sum(word_scores$score[word_scores$word %in% z]))} 

 # using sapply() in conjunction with the function  

 scores <- sapply(my_strings, score_fun, USE.NAMES = F)

 # the output will look like 
 scores
 [1] 20 26 24  9  0  0 38 32 30  0

2 个答案:

答案 0 :(得分:3)

您可以使用tidytext::unnest_tokens标记为单词,然后加入并汇总:

library(tidyverse)
library(tidytext)

data_frame(string = my_strings, id = seq_along(string)) %>% 
    unnest_tokens(word, string, 'words', to_lower = FALSE) %>% 
    distinct() %>%
    left_join(word_scores) %>% 
    group_by(id) %>%
    summarise(score = sum(score, na.rm = TRUE))

#> # A tibble: 10 × 2
#>       id score
#>    <int> <int>
#> 1      1    20
#> 2      2    26
#> 3      3    24
#> 4      4     9
#> 5      5     0
#> 6      6     0
#> 7      7    38
#> 8      8    32
#> 9      9    30
#> 10    10     0

如果您愿意,请保留原始字符串,或者最后通过ID重新加入它们。

在小数据上,速度要慢得多,但规模越来越快,例如当my_strings重新采样到10,000的长度时:

Unit: milliseconds
     expr        min         lq      mean    median        uq       max neval
   Reduce 5440.03300 5656.41350 5815.2094 5814.0406 5944.9969 6206.2502   100
   sapply  460.75930  486.94336  511.2762  503.4932  532.2363  746.8376   100
 tidytext   86.92182   94.65745  101.7064  100.1487  107.3289  134.7276   100

答案 1 :(得分:2)

考虑创建拆分字的数据框,然后合并到 word_scores ,最后按照短语ID聚合分数。这种方法避免了迭代sapply循环计算。

list_strings <- lapply(my_strings, function(i) unique(unlist(strsplit(s, split=" "))))      

ids <- lapply(seq(length(list_strings)), function(i) rep(i, length(list_strings[[i]])))

phrases_df <- data.frame(id=Reduce(append, ids), word=Reduce(append, list_strings))      
aggdf <- aggregate(score~id, merge(phrases_df, word_scores, by="word"), FUN=sum)

aggdf 
#   id score
# 1  1    20
# 2  2    26
# 3  3    24
# 4  4     9
# 5  7    38
# 6  8    32
# 7  9    30

虽然microbenchmark显示此方法对于发布的数据较慢(1毫秒= 1,000微秒),但如果数据量大于sapply,则可能会更好地扩展。

library(micorbenchmark)

microbenchmark({
   list_strings <- lapply(my_strings, function(i) unique(unlist(strsplit(s, split=" "))))

   ids <- lapply(seq(length(list_strings)), function(i) rep(i, length(list_strings[[i]])))

   phrases_df <- data.frame(id=Reduce(append, ids), word=Reduce(append, list_strings))      
   aggdf <- aggregate(score~id, merge(phrases_df, word_scores, by="word"), FUN=sum)

})

# Unit: milliseconds
#      min       lq     mean   median       uq      max neval
# 5.623328 5.808831 6.177336 5.964018 6.252019 10.09706   100

microbenchmark({
  score_fun<- function(x) {
     z <- unlist(strsplit(x,' '))
     return(sum(word_scores$score[word_scores$word %in% z]))
  } 
  scores <- sapply(my_strings, score_fun, USE.NAMES = F)
})

# Unit: microseconds
#       min      lq     mean  median       uq     max neval
# 809.382 843.307 1005.366 865.442 1209.983 1873.32   100