我正在为R中的 for loop 寻找一些简单的矢量化方法 我有以下数据框,包含句子和两个正面和负面词典:
# Create data.frame with sentences
sent <- data.frame(words = c("just right size and i love this notebook", "benefits great laptop",
"wouldnt bad notebook", "very good quality", "orgtop",
"great improvement for that bad product but overall is not good", "notebook is not good but i love batterytop"), user = c(1,2,3,4,5,6,7),
stringsAsFactors=F)
# Create pos/negWords
posWords <- c("great","improvement","love","great improvement","very good","good","right","very","benefits",
"extra","benefit","top","extraordinarily","extraordinary","super","benefits super","good","benefits great",
"wouldnt bad")
negWords <- c("hate","bad","not good","horrible")
现在我创建原始数据框的副本来模拟大数据集:
# Replicate original data.frame - big data simulation (700.000 rows of sentences)
df.expanded <- as.data.frame(replicate(100000,sent$words))
# library(zoo)
sent <- coredata(sent)[rep(seq(nrow(sent)),100000),]
rownames(sent) <- NULL
对于我的下一步,我将不得不用字词分数(pos word = 1和neg word = -1)对字典中的单词进行降序排序。
# Ordering words in pos/negWords
wordsDF <- data.frame(words = posWords, value = 1,stringsAsFactors=F)
wordsDF <- rbind(wordsDF,data.frame(words = negWords, value = -1))
wordsDF$lengths <- unlist(lapply(wordsDF$words, nchar))
wordsDF <- wordsDF[order(-wordsDF[,3]),]
rownames(wordsDF) <- NULL
然后我用for循环定义以下函数:
# Sentiment score function
scoreSentence2 <- function(sentence){
score <- 0
for(x in 1:nrow(wordsDF)){
matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words
count <- length(grep(matchWords,sentence)) # count them
if(count){
score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
sentence <- gsub(paste0('\\s*\\b', wordsDF[x,1], '\\b\\s*', collapse='|'), ' ', sentence) # remove matched words from wordsDF
# library(qdapRegex)
sentence <- rm_white(sentence)
}
}
score
}
我在数据框中的句子上调用了上一个函数:
# Apply scoreSentence function to sentences
SentimentScore2 <- unlist(lapply(sent$words, scoreSentence2))
# Time consumption for 700.000 sentences in sent data.frame:
# user system elapsed
# 1054.19 0.09 1056.17
# Add sentiment score to origin sent data.frame
sent <- cbind(sent, SentimentScore2)
所需的输出是:
Words user SentimentScore2
just right size and i love this notebook 1 2
benefits great laptop 2 1
wouldnt bad notebook 3 1
very good quality 4 1
orgtop 5 0
.
.
.
等等......
拜托,任何人都可以帮我减少原始方法的计算时间。由于我的初学者编程技巧在R我最终:-) 我们非常感谢您的任何帮助或建议。非常感谢你提前。
答案 0 :(得分:5)
本着“教人钓鱼比钓鱼更好”的精神,我将引导您完成:
复制你的代码:你要搞砸了!
找到瓶颈:
1a:缩小问题:
Rep <- 100
df.expanded <- as.data.frame(replicate(nRep,sent$words))
library(zoo)
sent <- coredata(sent)[rep(seq(nrow(sent)),nRep),]
1b:保留一个参考解决方案:你将改变你的代码,并且在引入错误方面很少有活动而不是优化代码!
sentRef <- sent
并添加相同内容但在代码末尾注释掉,以便记住您的引用位置。为了更容易检查您是不是搞乱了您的代码,您可以在代码的末尾自动测试它:
library("testthat")
expect_equal(sent,sentRef)
1c:在代码周围触发探查器以查看:
Rprof()
SentimentScore2 <- unlist(lapply(sent$words, scoreSentence2))
Rprof(NULL)
1d:使用基数R:
查看结果summaryRprof()
还有更好的工具,你可以检查包 探查 要么 lineprof
lineprof 是我的选择工具,这里有一个真正的附加值,允许将问题缩小到这两行:
matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words
count <- length(grep(matchWords,sentence)) # count them
修复它。
3.1幸运的是,主要问题相当简单:你不需要第一行在函数中,之前移动它。顺便说一句,这同样适用于你的paste0()。您的代码变为:
matchWords <- paste("\\<",wordsDF[,1],'\\>', sep="") # matching exact words
matchedWords <- paste0('\\s*\\b', wordsDF[,1], '\\b\\s*')
# Sentiment score function
scoreSentence2 <- function(sentence){
score <- 0
for(x in 1:nrow(wordsDF)){
count <- length(grep(matchWords[x],sentence)) # count them
if(count){
score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
sentence <- gsub(matchedWords[x],' ', sentence) # remove matched words from wordsDF
require(qdapRegex)
# sentence <- rm_white(sentence)
}
}
score
}
这改变了来自
的1000个代表的执行时间
5.64s到2.32s。投资不错!
3.2下一个颈部是“计数&lt; - ”线,但我认为 影子有正确的答案:-)我们得到的结合:
matchWords <- paste("\\<",wordsDF[,1],'\\>', sep="") # matching exact words
matchedWords <- paste0('\\s*\\b', wordsDF[,1], '\\b\\s*')
# Sentiment score function
scoreSentence2 <- function(sentence){
score <- 0
for(x in 1:nrow(wordsDF)){
count <- grepl(matchWords[x],sentence) # count them
score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
sentence <- gsub(matchedWords[x],' ', sentence) # remove matched words from wordsDF
require(qdapRegex)
# sentence <- rm_white(sentence)
}
score
}
这样可以快0.18s或31倍......
答案 1 :(得分:1)
您可以轻松地对scoreSentence2
函数进行矢量化,因为grep
,grepl
已经过矢量化了:
scoreSentence <- function(sentence){
score <- rep(0, length(sentence))
for(x in 1:nrow(wordsDF)){
matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words
count <- grepl(matchWords, sentence) # count them
score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
sentence <- gsub(paste0('\\s*\\b', wordsDF[x,1], '\\b\\s*', collapse='|'), ' ', sentence) # remove matched words from wordsDF
sentence <- rm_white(sentence)
}
return(score)
}
scoreSentence(sent$words)
请注意,count
实际上并不计算表达式出现在一个句子中的次数(在您的版本中也不在我的版本中)。它只是告诉你表达式是否完全出现。如果你想真正计算它们,你可以使用以下代码。
count <- sapply(gregexpr(matchWords, sentence), function(x) length(x[x>0]))