这是我的数据框:
b<-data.frame(SEQUENCEID=c("0","1","1","1","2","2"),
EVENTID=c("1","1","1","2","1","2"),
ITEM=c("z","a","b","a","c","a"))
SEQUENCEID EVENTID ITEM
1 0 1 z
2 1 1 a
3 1 1 b
4 1 2 a
5 2 1 c
6 2 2 a
我需要将与唯一组合[SEQUENCEID; EVENTID]关联的项目放入列表元素中,以便我希望我的最终结果如下所示:
$`1`
"0" "1" "z"
$`2`
"1" "1" "a" "b"
$`3`
"1" "2" "a"
$`4`
"2" "1" "c"
其实我知道怎么做,我的问题是因为我的data.frame中有大约100万行需要花费太多时间。这是脚本:
#STEP 1
b$combi=as.character(paste(b$SEQUENCEID,b$EVENTID,sep="|"))
combi_unique=unique(b$combi)
stock=sapply(combi_unique,function(x) b$ITEM[which(b$combi==x)])
names(stock)=NULL
#STEP 2
r=as.list(as.data.frame(t(unique(b[,c("SEQUENCEID","EVENTID")]))))
#STEP 3
results=mapply(c, r, stock, SIMPLIFY=FALSE)
您如何重新编码以使其更快地运行?
答案 0 :(得分:1)
试
split(b$ITEM, with(b, interaction(SEQUENCEID, EVENTID)))
b<-data.frame(SEQUENCEID=c("0","1","1","1","2","2"),
EVENTID=c("1","1","1","2","1","2"),
ITEM=c("z","a","b","a","c","a"), stringsAsFactors=FALSE)
# ensure stringsAsFactors=FALSE; in your big data frame this would ..
# ... translate as if(is.factor(b$ITEM)) b$ITEM<-as.character(b$ITEM)
bs <- split(b, with(b, interaction(SEQUENCEID, EVENTID)))
# get rid of empty elements:
bs <- bs[sapply(ns, NROW)>0]
lapply(bs, function(x) with(x,c(SEQUENCEID[1], EVENTID[1], ITEM)))
# alternatively:
lapply(bs, function(x) with(x,c(SEQUENCEID=SEQUENCEID[1], EVENTID=EVENTID[1], ITEM)))
..我想会有一个更快的data.table解决方案
答案 1 :(得分:1)
您可以尝试使用tidyr和dplyr的组合:
library(dplyr)
library(tidyr)
b = data_frame(
SEQUENCEID = c("0","1","1","1","2","2"),
EVENTID = c("1","1","1","2","1","2"),
ITEM = c("z","a","b","a","c","a")
)
final = b %>%
group_by(SEQUENCEID, EVENTID) %>%
nest() %>%
lapply(identity)
<强>基准强>
我用相同的结构模拟了一个更大的数据框:10 ^ 7行:
library(dplyr)
library(tidyr)
b = data_frame(
SEQUENCEID = sample(1:10, 10^7, replace = T),
EVENTID = sample(1:10, 10^7, replace = T),
ITEM = sample(letters, 10^7, replace = T)
)
并且代码在我的Mac上运行大约3秒:
system.time({
final = b %>%
group_by(SEQUENCEID, EVENTID) %>%
nest() %>%
lapply(identity)
})
对于10 ^ 8数据集,它需要更多:44秒