计算令牌单词的最佳和最有效的方法

时间:2010-12-10 20:58:13

标签: r

我有一个以3列开头的CSV。 “成本”,“成本”列和关键字列的累积百分比列。 R脚本适用于小文件,但当我将它提供给实际文件(有一百万行)时,它完全死掉(永远不会完成)。你能帮助我提高这个脚本的效率吗? Token.Count是我在创建时遇到的问题。谢谢!

# Token Histogram

# Import CSV data from Report Downloader API Feed
Mydf <- read.csv("Output_test.csv.csv", sep=",", header = TRUE, stringsAsFactors=FALSE)

# Helps limit the dataframe according the HTT
# Change number to:
# .99 for big picture
# .8 for HEAD
limitor <- Mydf$CumuCost <= .8
# De-comment to ONLY measure TORSO
#limitor <- (Mydf$CumuCost <= .95 & Mydf$CumuCost > .8)
# De-comment to ONLY measure TAIL
#limitor <- (Mydf$CumuCost <= 1 & Mydf$CumuCost > .95)
# De-comment to ONLY measure Non-HEAD
#limitor <- (Mydf$CumuCost <= 1 & Mydf$CumuCost > .8)

# Creates a column with HTT segmentation labels
# Creates a dataframe
HTT <- data.frame()
# Populates dataframe according to conditions
HTT <- ifelse(Mydf$CumuCost <= .8,"HEAD",ifelse(Mydf$CumuCost <= .95,"TORSO","TAIL"))
# Add the column to Mydf and rename it HTT
Mydf <- transform(Mydf, HTT = HTT)

# Count all KWs in account by using the dimension function
KWportfolioSize <- dim(Mydf)[1]

# Percent of portfolio
PercentofPortfolio <- sum(limitor)/KWportfolioSize

# Length of Keyword -- TOO SLOW
# Uses the Tau package
# My function takes the row number and returns the number of tokens
library(tau)
Myfun = function(n) {
  sum(sapply(Mydf$Keyword.text[n], textcnt, split = "[[:space:][:punct:]]+", method = "string", n = 1L))}
# Creates a dataframe to hold the results
Token.Count <- data.frame()
# Loops until last row and store it in data.frame
for (i in c(1:dim(Mydf)[1])) {Token.Count <- rbind(Token.Count,Myfun(i))}
# Add the column to Mydf
Mydf <- transform(Mydf, Token.Count = Token.Count)
# Not quite sure why but the column needs renaming in this case
colnames(Mydf)[dim(Mydf)[2]] <- "Token.Count"

1 个答案:

答案 0 :(得分:2)

在之前预先存储存储从不执行您正在执行的操作并在循环内连接或r | cbind个对象。 R必须在循环的每次迭代中复制,分配更多存储等,这是削弱代码的开销。

创建具有足够行和列的Token.Count并将其填入循环中。类似的东西:

Token.Count <- matrix(ncol = ?, nrow = nrow(Mydf))
for (i in seq_len(nrow(Mydf))) {
    Token.Count[i, ] <- Myfun(i)
}
Token.Count <- data.frame(Token.Count)

抱歉,我不能更具体,但我不知道有多少列Myfun返回。


更新1:看了textcnt后,我认为你可以完全避开循环。你有这样的数据框

DF <- data.frame(CumuCost = c(0.00439, 0.0067), Cost = c(1678, 880),
                 Keyword.text = c("north+face+outlet", "kinect sensor"),
                 stringsAsFactors = FALSE)

如果我们删除关键字,并将其转换为列表

keywrds <- with(DF, as.list(Keyword.text))
head(keywrds)

然后我们可以在此列表中以递归方式调用textcnt 来计算每个列表组件中的单词;

countKeys <- textcnt(keywrds, split = "[[:space:][:punct:]]+", method = "string",
                     n = 1L, recursive = TRUE)
head(countKeys)

以上几乎就是你所拥有的,除了我添加了recursive = TRUE来分别处理每个输入向量。最后一步是sapplysum函数设置为countKeys以获取字数:

> sapply(countKeys, sum)
[1] 3 2

这似乎是您尝试使用循环和函数实现的。我做对了吗?


更新2:好的,如果以矢量化的方式修复了预分配问题使用textcnt仍然没有您想要的那么快,我们可以调查其他计算单词的方法。您很可能不需要textcnt的所有功能来执行您想要的操作。 [我无法检查下面的解决方案是否适用于您的所有数据,但速度要快得多。]

一个可能的解决方案是使用内置的Keyword.text函数将strsplit向量拆分为单词,例如使用上面生成的keywrds并且仅使用第一个元素:

> length(unlist(strsplit(keywrds[[1]], split = "[[:space:][:punct:]]+")))
[1] 3

要使用这个想法,将它包装在用户函数中可能更容易:

fooFun <- function(x) {
    length(unlist(strsplit(x, split = "[[:space:][:punct:]]+"),
                  use.names = FALSE, recursive = FALSE))
}

然后我们可以将其应用于keywrds列表:

> sapply(keywrds, fooFun)
[1] 3 2

对于这个简单的示例数据集,我们得到相同的结果。计算时间怎么样?首先使用textcnt解决方案,结合 Update 1 中的两个步骤:

> system.time(replicate(10000, sapply(textcnt(keywrds, 
+                                     split = "[[:space:][:punct:]]+", 
+                                     method = "string", n = 1L, 
+                                     recursive = TRUE), sum)))
   user  system elapsed 
  4.165   0.026   4.285

然后是 Update 2 中的解决方案:

> system.time(replicate(10000, sapply(keywrds, fooFun)))
   user  system elapsed 
  0.883   0.001   0.889

因此,即使对于这个小样本,调用textcnt也会产生相当大的开销,但是当将这两种方法应用于完整数据集时,这种差异是否成立还有待观察。

最后,我们应该注意strsplit方法可以被矢量化,直接用于Keyword.text中的向量DF

> sapply(strsplit(DF$Keyword.text, split = "[[:space:][:punct:]]+"), length)
[1] 3 2

给出了与其他两种方法相同的结果,并且比strsplit的非向量化使用略快:

> system.time(replicate(10000, sapply(strsplit(DF$Keyword.text, 
+                              split = "[[:space:][:punct:]]+"), length)))
   user  system elapsed 
  0.732   0.001   0.734

您的完整数据集中的任何一个都更快吗?

次要更新:复制DF以提供130行数据并计时三种方法表明最后一次(向量化strsplit())更好地扩展:

> DF2 <- rbind(DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF)
> dim(DF2)
[1] 130   3
> system.time(replicate(10000, sapply(textcnt(keywrds2, split = "[[:space:][:punct:]]+", method = "string", n = 1L, recursive = TRUE), sum)))
   user  system elapsed 
238.266   1.790 241.404
> system.time(replicate(10000, sapply(keywrds2, fooFun)))
   user  system elapsed 
 28.405   0.007  28.511
> system.time(replicate(10000, sapply(strsplit(DF2$Keyword.text,split = "[[:space:][:punct:]]+"), length)))
   user  system elapsed 
  7.497   0.011   7.528