我有一个以3列开头的CSV。 “成本”,“成本”列和关键字列的累积百分比列。 R脚本适用于小文件,但当我将它提供给实际文件(有一百万行)时,它完全死掉(永远不会完成)。你能帮助我提高这个脚本的效率吗? Token.Count是我在创建时遇到的问题。谢谢!
# Token Histogram
# Import CSV data from Report Downloader API Feed
Mydf <- read.csv("Output_test.csv.csv", sep=",", header = TRUE, stringsAsFactors=FALSE)
# Helps limit the dataframe according the HTT
# Change number to:
# .99 for big picture
# .8 for HEAD
limitor <- Mydf$CumuCost <= .8
# De-comment to ONLY measure TORSO
#limitor <- (Mydf$CumuCost <= .95 & Mydf$CumuCost > .8)
# De-comment to ONLY measure TAIL
#limitor <- (Mydf$CumuCost <= 1 & Mydf$CumuCost > .95)
# De-comment to ONLY measure Non-HEAD
#limitor <- (Mydf$CumuCost <= 1 & Mydf$CumuCost > .8)
# Creates a column with HTT segmentation labels
# Creates a dataframe
HTT <- data.frame()
# Populates dataframe according to conditions
HTT <- ifelse(Mydf$CumuCost <= .8,"HEAD",ifelse(Mydf$CumuCost <= .95,"TORSO","TAIL"))
# Add the column to Mydf and rename it HTT
Mydf <- transform(Mydf, HTT = HTT)
# Count all KWs in account by using the dimension function
KWportfolioSize <- dim(Mydf)[1]
# Percent of portfolio
PercentofPortfolio <- sum(limitor)/KWportfolioSize
# Length of Keyword -- TOO SLOW
# Uses the Tau package
# My function takes the row number and returns the number of tokens
library(tau)
Myfun = function(n) {
sum(sapply(Mydf$Keyword.text[n], textcnt, split = "[[:space:][:punct:]]+", method = "string", n = 1L))}
# Creates a dataframe to hold the results
Token.Count <- data.frame()
# Loops until last row and store it in data.frame
for (i in c(1:dim(Mydf)[1])) {Token.Count <- rbind(Token.Count,Myfun(i))}
# Add the column to Mydf
Mydf <- transform(Mydf, Token.Count = Token.Count)
# Not quite sure why but the column needs renaming in this case
colnames(Mydf)[dim(Mydf)[2]] <- "Token.Count"
答案 0 :(得分:2)
在之前预先存储存储。 从不执行您正在执行的操作并在循环内连接或r
| cbind
个对象。 R必须在循环的每次迭代中复制,分配更多存储等,这是削弱代码的开销。
创建具有足够行和列的Token.Count
并将其填入循环中。类似的东西:
Token.Count <- matrix(ncol = ?, nrow = nrow(Mydf))
for (i in seq_len(nrow(Mydf))) {
Token.Count[i, ] <- Myfun(i)
}
Token.Count <- data.frame(Token.Count)
抱歉,我不能更具体,但我不知道有多少列Myfun
返回。
更新1:看了textcnt
后,我认为你可以完全避开循环。你有这样的数据框
DF <- data.frame(CumuCost = c(0.00439, 0.0067), Cost = c(1678, 880),
Keyword.text = c("north+face+outlet", "kinect sensor"),
stringsAsFactors = FALSE)
如果我们删除关键字,并将其转换为列表
keywrds <- with(DF, as.list(Keyword.text))
head(keywrds)
然后我们可以在此列表中以递归方式调用textcnt
来计算每个列表组件中的单词;
countKeys <- textcnt(keywrds, split = "[[:space:][:punct:]]+", method = "string",
n = 1L, recursive = TRUE)
head(countKeys)
以上几乎就是你所拥有的,除了我添加了recursive = TRUE
来分别处理每个输入向量。最后一步是sapply
将sum
函数设置为countKeys
以获取字数:
> sapply(countKeys, sum)
[1] 3 2
这似乎是您尝试使用循环和函数实现的。我做对了吗?
更新2:好的,如果以矢量化的方式修复了预分配问题和使用textcnt
仍然没有您想要的那么快,我们可以调查其他计算单词的方法。您很可能不需要textcnt
的所有功能来执行您想要的操作。 [我无法检查下面的解决方案是否适用于您的所有数据,但速度要快得多。]
一个可能的解决方案是使用内置的Keyword.text
函数将strsplit
向量拆分为单词,例如使用上面生成的keywrds
并且仅使用第一个元素:
> length(unlist(strsplit(keywrds[[1]], split = "[[:space:][:punct:]]+")))
[1] 3
要使用这个想法,将它包装在用户函数中可能更容易:
fooFun <- function(x) {
length(unlist(strsplit(x, split = "[[:space:][:punct:]]+"),
use.names = FALSE, recursive = FALSE))
}
然后我们可以将其应用于keywrds
列表:
> sapply(keywrds, fooFun)
[1] 3 2
对于这个简单的示例数据集,我们得到相同的结果。计算时间怎么样?首先使用textcnt
解决方案,结合 Update 1 中的两个步骤:
> system.time(replicate(10000, sapply(textcnt(keywrds,
+ split = "[[:space:][:punct:]]+",
+ method = "string", n = 1L,
+ recursive = TRUE), sum)))
user system elapsed
4.165 0.026 4.285
然后是 Update 2 中的解决方案:
> system.time(replicate(10000, sapply(keywrds, fooFun)))
user system elapsed
0.883 0.001 0.889
因此,即使对于这个小样本,调用textcnt
也会产生相当大的开销,但是当将这两种方法应用于完整数据集时,这种差异是否成立还有待观察。
最后,我们应该注意strsplit
方法可以被矢量化,直接用于Keyword.text
中的向量DF
:
> sapply(strsplit(DF$Keyword.text, split = "[[:space:][:punct:]]+"), length)
[1] 3 2
给出了与其他两种方法相同的结果,并且比strsplit
的非向量化使用略快:
> system.time(replicate(10000, sapply(strsplit(DF$Keyword.text,
+ split = "[[:space:][:punct:]]+"), length)))
user system elapsed
0.732 0.001 0.734
您的完整数据集中的任何一个都更快吗?
次要更新:复制DF
以提供130行数据并计时三种方法表明最后一次(向量化strsplit()
)更好地扩展:
> DF2 <- rbind(DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF)
> dim(DF2)
[1] 130 3
> system.time(replicate(10000, sapply(textcnt(keywrds2, split = "[[:space:][:punct:]]+", method = "string", n = 1L, recursive = TRUE), sum)))
user system elapsed
238.266 1.790 241.404
> system.time(replicate(10000, sapply(keywrds2, fooFun)))
user system elapsed
28.405 0.007 28.511
> system.time(replicate(10000, sapply(strsplit(DF2$Keyword.text,split = "[[:space:][:punct:]]+"), length)))
user system elapsed
7.497 0.011 7.528