Question

我试图在大型数据集中进行单词匹配。我想知道是否有办法加快我工作流程中最慢的操作。

我的目的是找到单词词典和单词向量列表之间匹配的位置。

words <- c("cat", "dog", "snake", "cow")
scores <- c(1.5, 0.7, 3.5, 4.6)
dic <- data.frame(words, scores)

wordList <- list(c("jiraffe", "dog"), c("cat", "elephant"), c("snake", "cow"))

到目前为止我发现的最快的方法就是这样做：

matches <- function(wordList) {
    subD <- which(dic$words %in% wordList)
}

我想要的输出是：

matches(wordList):
list(c(2), c(1), c(3, 4))

我可以稍后用它来获取每个wordList单元格的平均分数

averageScore <- sapply(matches, function(x) {mean(dic[x, "scores"]})

有没有比我在函数中所做的更快的方式进行字符串匹配：

subD <- which(dic$words %in% wordList)

我尝试过dplyr方式，认为它可能更快，首先使用＆＃34; filter＆＃34;获得＆＃34; dic＆＃34;的一部分并应用＆＃34; colMeans＆＃34;在它上面，但似乎慢了两倍。

此外，在循环中运行我的匹配功能与使用＆＃34; lapply＆＃34;一样慢。在它上面。

我错过了什么吗？有没有比两者都快的方法？

Answer 1

这是一个选项：

library(data.table)
nn <- lengths(wordList)  ## Or, for < R-3.2.0, `nn <- sapply(wordList, length)` 
dt <- data.table(grp=rep(seq_along(nn), times=nn), X = unlist(wordList), key="grp")
dt[,Score:=scores[chmatch(X,words)]]
dt[!is.na(Score), list(avgScore=mean(Score)), by="grp"]
#    grp avgScore
# 1:   1     0.70
# 2:   2     1.50
# 3:   3     4.05

R中的快速字符串匹配

1 个答案: