我跟进了question。
我正在以现有data.frame的列名和特定行条目为条件创建data.frame。以下是我使用 for循环解决问题的方法(感谢@Roland的建议......实际数据违反了@ eddi的答案要求),但它一直在实际数据集上运行(200x500, 000+ rows.cols)现在超过两个小时......
(以下生成的data.frames与实际数据非常相似。)
set.seed(1)
a <- data.frame(year=c(1986:1990),
events=round(runif(5,0,5),digits=2))
b <- data.frame(year=c(rep(1986:1990,each=2,length.out=40),1986:1990),
region=c(rep(c("x","y"),10),rep(c("y","z"),10),rep("y",5)),
state=c(rep(c("NY","PA","NC","FL"),each=10),rep("AL",5)),
events=round(runif(45,0,5),digits=2))
d <- matrix(rbinom(200,1,0.5),10,20, dimnames=list(c(1:10), rep(1986:1990,each=4)))
e <- data.frame(id=sprintf("%02d",1:10), as.data.frame(d),
region=c("x","y","x","z","z","y","y","z","y","y"),
state=c("PA","AL","NY","NC","NC","NC","FL","FL","AL","AL"))
for (i in seq_len(nrow(d))) {
for (j in seq_len(ncol(d))) {
d[i,j] <- ifelse(d[i,j]==0,
a$events[a$year==colnames(d)[j]],
b$events[b$year==colnames(d)[j] &
b$state==e$state[i] &
b$region==e$region[i]])
}
}
有更好/更快的方法吗?
答案 0 :(得分:0)
# This will require a couple of merges,
# but first let's convert the data to long form and extract year as integer
# I convert result to data.table, since that's easier and faster to deal with
# Note: it *is* possible to do the melt/dcast entirely in data.table framework,
# but it's a hassle right now - there is a FR iirc about that
library(reshape2)
library(data.table)
dt = data.table(melt(e))[, year := as.integer(sub('X([0-9]*).*','\\1',variable))]
# set key for merging and merge with b and a
setkey(dt, year, region, state)
dt.result = data.table(a, key = 'year')[
data.table(b, key = c('year', 'region', 'state'))[dt]]
# now we can compute the value we want
dt.result[, final.value := value * events.1 + (!value) * events]
# dcast back
e.result = dcast(dt.result, id + region + state ~ variable,
value.var = 'final.value')
答案 1 :(得分:0)
一种更简单的方法(我认为 - 它不涉及融化,转型和合并)如下:
首先,您的a和b数组应按年份(a)和年/州/地区(b)索引:
at = a$events; names(at) = a$year
bt = tapply(b$events,list(b$year,b$state,b$region),function(x) min(x))
# note, I used min(x) in tapply just to be on the safe side, that the functions always returns a scalar
# we now create the result of the more complex case (lookup in b)
ids = cbind(colnames(d)[col(d)],
as.character(e$state[row(d)]),
as.character(e$region[row(d)])
)
vals=bt[ids]; dim(vals)=dim(d)
# and compute your desired result with the ifelse
result = ifelse(d==0,at[colnames(d)[col(d)]],vals)
# and that's it!
这应该更快(避免嵌套循环),但我没有对此进行分析。请告诉我们如何在完整数据上为您提供帮助