改善循环的运行时间

时间:2015-10-20 05:29:14

标签: r loops sqldf

我正在努力提高以下流程的计算效率。我创建了使用数据进行审查的玩具示例。第一种方法在第二种方法的一半时间内运行。

如何改进第一种方法的运行时间?

SELECT * FROM (Select * FROM Table Where id <> 12) t WHERE phone = '123456' Or email = 'example@example.com'

我正在尝试为每个id计算1的第一个实例。

预期产出:

library(sqldf)
id = c(1,1,1,1,2,2,2,5,5,5,5,5,5)
qn = c(0,0,1,1,0,1,0,0,0,1,0,1,0)
d = data.frame(cbind(id,qn))
names(d) = c("id", "qn")

un = unique(d$id)
holder = matrix(0,length(un), 1)
counter = 0

x = proc.time()

for (i in un)
{
  z = head(which(d[d$id == i,]$qn==1),1)
  counter = counter + 1
  holder[counter,] = z
}

proc.time() - x
f = sqldf("select id, count(qn) from d group by id", drv = 'SQLite')
f = cbind(f,holder)
#################################
un = unique(d$id)
holder = matrix(0,length(un), 1)
counter = 0

x = proc.time()

for (i in 1:length(un))
{
  y = paste("select * from d where id = ", un[i])
  y = sqldf(y, drv = 'SQLite')
  y = min(which(y$qn==1))
  counter = counter + 1
  holder[counter,] = y
}

proc.time() - x
f = sqldf("select id, count(qn) from d group by id", drv = 'SQLite')
f = cbind(f,holder)

3 个答案:

答案 0 :(得分:4)

您可以在sqldf使用dplyr

的情况下执行此操作
library(dplyr)
d %>% 
    group_by(id) %>% 
    summarize(first=first(which(qn==1)))

答案 1 :(得分:4)

我们也可以使用data.table

library(data.table)
setDT(d)[, list(first= which.max(qn)) , id]

答案 2 :(得分:3)

1)在lapply中使用sqldf:

do.call(rbind,
        lapply(split(d, id), function(i)
          sqldf("SELECT id, min(rowid) AS first
                 FROM (SELECT rowid, *
                       FROM i) AS x
                 WHERE qn = 1"))
        )

##   id first
## 1  1     3
## 2  2     2
## 5  5     3

2)或者对于纯SQL解决方案,从每个组的qn = 1的第一个rowid中减去每个组中第一行的rowid并添加1:

sqldf("select id, min_row1 - min_row + 1 first 
       from (select id, min(rowid) min_row 
             from d 
             group by id)
       join (select id, min(rowid) min_row1 
             from d where qn = 1 
             group by id) using (id)")


##   id first
## 1  1     3
## 2  2     2
## 3  5     3

3)或者对于替代的纯SQL解决方案,在内部选择中的id内创建一个序列seq,然后在id组中选择第一个qn = 1:< / p>

sqldf("select id, min(seq) first 
       from (select x.id, x.qn, count() seq 
             from d x 
             join d y on x.rowid >= y.rowid and x.id = y.id 
             group by x.rowid)
       where qn = 1
       group by id")

##   id first
## 1  1     3
## 2  2     2
## 3  5     3