如何让这个循环在R中运行得更快?

时间:2016-04-21 08:20:03

标签: r performance loops

有一个大约4.4万字的词典数据框 words.dict ,以下代码应该替换数据集 dataset.num 中的所有单词他们在字典中的数字ID。

data.num:

dput(head(dataset.num))
c("rt   breaking  will from here forward be know as", "i hope you like wine and cocktails", "this week we are upgrading our servers  there may be periodic disruptions to the housing application portal  sorry for any inconvenience", "hanging out in  foiachat  anyone have fav  management software on the gov t side  anything from intake to redaction   onwards", "they left out kourtney  instead they let chick from big bang talk", "i  am  encoding  film   like  for the  billionth time already ")

words.dict:

dput(head(words.dict,20)
structure(list(id = c(10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 3L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L), word = structure(1:20, .Label =c("already", "am", "and", "any", "anyone", "anything", "application", "are", "as", "bang", "be", "big", "billionth", "breaking", "chick", "cocktails","disruptions", "encoding", "fav", "film", "foiachat", "for", "forward", "from", "gov", "hanging", "have", "here", "hope", "housing", "i", "in", "inconvenience", "instead", "intake", "know", "kourtney", "left", "let", "like", "management", "may", "on", "onwards", "our", "out", "periodic", "portal", "redaction", "rt", "servers", "side", "software", "sorry", "t", "talk", "the", "there", "they", "this", "time", "to", "upgrading", "we", "week", "will", "wine", "you"), class = "factor")), .Names = c("id", "word"), row.names = c(10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 3L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L), class = "data.frame")

循环:

for (i in 1:nrow(words.dict))

    dataset.num <-  gsub(paste0("\\b(", words.dict[i,"word"], ")\\b"),words.dict[i,1], dataset.num) 

虽然我截断了数据,但 dataset.num 是一个近4万行的字符向量(每行平均包含20个字)。该代码适用于小数据,但在处理速度有限的大数据上却不那么快。

你有什么建议来提高效率&amp;代码的表现?

1 个答案:

答案 0 :(得分:1)

这是一种不同的方法,也许可以更好地扩展,但我还没有真正测试过它。

sapply(strsplit(dataset.num, "\\s+"), function(y) {
  i <- match(y, words.dict$word)
  y[!is.na(i)] <- words.dict$id[na.omit(i)]
  paste(y, collapse = " ")
})
#[1] "rt 22 will from here forward 3 know 18"                                                                           
#[2] "i hope you like wine 12 24"                                                                                       
#[3] "this week we 17 upgrading our servers there may 3 periodic 25 to the housing 16 portal sorry for 13 inconvenience"
#[4] "hanging out in foiachat 14 have 27 management software on the gov t side 15 from intake to redaction onwards"     
#[5] "they left out kourtney instead they let 23 from 20 19 talk"                                                       
#[6] "i 11 26 28 like for the 21 time 10"

请注意,您可以使用stringi::stri_split加快字符串拆分。