需要在R中表现 - "其中"似乎太长了

时间:2016-05-23 07:43:15

标签: r performance which

这是我的数据框:

b<-data.frame(SEQUENCEID=c("0","1","1","1","2","2"),
              EVENTID=c("1","1","1","2","1","2"),
              ITEM=c("z","a","b","a","c","a"))

  SEQUENCEID EVENTID ITEM
1          0       1    z
2          1       1    a
3          1       1    b
4          1       2    a
5          2       1    c
6          2       2    a

我需要将与唯一组合[SEQUENCEID; EVENTID]关联的项目放入列表元素中,以便我希望我的最终结果如下所示:

$`1`          
       "0"        "1"        "z" 

$`2`                    
       "1"        "1"        "a"        "b" 

$`3`          
       "1"        "2"        "a" 

$`4`          
       "2"        "1"        "c" 

其实我知道怎么做,我的问题是因为我的data.frame中有大约100万行需要花费太多时间。这是脚本:

#STEP 1
b$combi=as.character(paste(b$SEQUENCEID,b$EVENTID,sep="|"))
combi_unique=unique(b$combi)
stock=sapply(combi_unique,function(x) b$ITEM[which(b$combi==x)])
names(stock)=NULL

#STEP 2
r=as.list(as.data.frame(t(unique(b[,c("SEQUENCEID","EVENTID")]))))

#STEP 3
results=mapply(c, r, stock, SIMPLIFY=FALSE)

您如何重新编码以使其更快地运行?

2 个答案:

答案 0 :(得分:1)

split(b$ITEM, with(b, interaction(SEQUENCEID, EVENTID)))

或完全按照您的要求格式化:

b<-data.frame(SEQUENCEID=c("0","1","1","1","2","2"),
              EVENTID=c("1","1","1","2","1","2"),
              ITEM=c("z","a","b","a","c","a"), stringsAsFactors=FALSE)
# ensure stringsAsFactors=FALSE; in your big data frame this would ..
# ... translate as if(is.factor(b$ITEM)) b$ITEM<-as.character(b$ITEM)
bs <- split(b, with(b, interaction(SEQUENCEID, EVENTID)))

# get rid of empty elements:
bs <- bs[sapply(ns, NROW)>0]

lapply(bs, function(x) with(x,c(SEQUENCEID[1], EVENTID[1], ITEM)))
# alternatively:
lapply(bs, function(x) with(x,c(SEQUENCEID=SEQUENCEID[1], EVENTID=EVENTID[1], ITEM)))

..我想会有一个更快的data.table解决方案

答案 1 :(得分:1)

您可以尝试使用tidyr和dplyr的组合:

library(dplyr)
library(tidyr)
b = data_frame(
  SEQUENCEID = c("0","1","1","1","2","2"),
  EVENTID = c("1","1","1","2","1","2"),
  ITEM = c("z","a","b","a","c","a")
)    

final = b %>% 
  group_by(SEQUENCEID, EVENTID) %>% 
  nest() %>% 
  lapply(identity)

<强>基准

我用相同的结构模拟了一个更大的数据框:10 ^ 7行:

library(dplyr)
library(tidyr)
b = data_frame(
   SEQUENCEID = sample(1:10, 10^7, replace = T),
   EVENTID = sample(1:10, 10^7, replace = T),
   ITEM = sample(letters, 10^7, replace = T)
)

并且代码在我的Mac上运行大约3秒:

system.time({
  final = b %>% 
    group_by(SEQUENCEID, EVENTID) %>% 
    nest() %>% 
    lapply(identity)
})

对于10 ^ 8数据集,它需要更多:44秒