我对使用列表还是很陌生,所以如果这个问题听起来很愚蠢,我深表歉意。
我从最初的459,046位客户中创建了一个函数,该函数将基础拆分并存储在列表的多个元素中。
sampled_list <- baseSample(dataset = clv_df_cbs, sample.size = 10000, seed = 12345)
执行此功能( baseSample ),您将获得一个新的对象列表,其中包含互斥的客户组(每个组将由10,000个客户组成-除了最后一个可能较小的客户,取决于初始音量)
> sampled_list <- baseSample(dataset = clv_df_cbs, sample.size = 10000, seed = 12345)
[1] "Seed: 12345"
[1] "Total groups created: 46"
[1] "Group size: 10000"
在这种情况下,输出是存储在名为 sample_list 的对象中的46个元素的列表。
现在,我想将这46个元素中的每一个传递给BTYD模型,该模型将预测未来90天内的交易数量(根据输入的经验得出)。
之所以不能将完整的数据集传递给BTYD模型,是因为该模型大量使用mcmc,因此计算时间过长,导致模型无法提供任何输出。因此,我决定多次(在足够大的样本上)运行同一模型来生成预测,直到我设法将所有基础作为模型输入。
需要对每个元素执行的操作如下
# Estimate parameters for element1 of the list
pggg.draws1 <- pggg.mcmc.DrawParameters(element1,
mcmc = 1000, # number of MCMC steps
burnin = 250, # number of initial MCMC steps which are discarded
thin = 10, # only every thin-th MCMC step will be returned
chains = 2, # number of MCMC chains to be run
trace = 50) # print logging step every trace iteration
# generate draws for holdout period
pggg.xstar.draws1 <- mcmc.DrawFutureTransactions(element1, pggg.draws1)
# conditional expectations
element1$xstar.pggg <- apply(pggg.xstar.draws1, 2, mean)
# P(active)
element1$pactive.pggg <- mcmc.PActive(pggg.xstar.draws1)
# P(alive)
element1$palive.pggg <- mcmc.PAlive(pggg.draws1)
# show estimates for first few customers
head(element1[, c("x", "t.x", "x.star",
"xstar.pggg", "pactive.pggg", "palive.pggg")],50)
# report median cohort-level parameter estimates
round(apply(as.matrix(pggg.draws1$level_2), 2, median), 3)
# report mean over median individual-level parameter estimates
median.est1 <- sapply(pggg.draws1$level_1, function(draw) {
apply(as.matrix(draw), 2, median)
})
round(apply(median.est1, 1, mean), 3)
理想情况下,输出应直接存储到新的data.frame中-这样我就可以检索ID和预测(以及最初包含在数据集中的其他内容)。
下面是一些可用于公开数据集中的模拟数据。
library(BTYDplus)
library(tidyverse)
data("groceryElog")
dataset<-elog2cbs(groceryElog, T.cal = "2006-12-01")
# FUNCTION baseSample ####
baseSample <- function(dataset, sample.size, seed=NULL) {
seed.value <- if(is.null(seed)) {
as.numeric(format(Sys.Date(),"%Y"))*10000+as.numeric(format(Sys.Date(),"%m"))*100+as.numeric(format(Sys.Date(),"%d"))
} else {
seed
}
set.seed(seed.value)
# RE-ORDER DATA FRAME (SAME LENGTH)
data <- with(dataset, dataset[order(sample(cust, nrow(dataset))),])
# BUILD A LIST OF DFs
set.sample.size <- sample.size
data$cycles_group <- paste0("sample_", ceiling(1:nrow(data)/set.sample.size))
df_list <- split(data, data$cycles_group)
print(paste0("Seed: ", seed.value))
print(paste0("Total groups created: ", length(unique(data$cycles_group))))
print(paste0("Group size: ", set.sample.size))
return(df_list)
#print(df_list)
}
# ** OUTPUT: Base split in lists ####
sampled_list <- baseSample(dataset = dataset, sample.size = 100, seed = 12345)
谢谢
答案 0 :(得分:3)
在基数R中,可以使用lapply
在列表的元素上迭代一个函数,并使用这些迭代的结果返回一个新列表。使用示例代码生成名为sampled_list
...
# turn the code for the operations you want to perform on each list element into a function,
# with a couple of minor tweaks
thingy <- function(i) {
# Estimate parameters for element1 of the list
pggg.draws1 <- pggg.mcmc.DrawParameters(i,
mcmc = 1000, # number of MCMC steps
burnin = 250, # number of initial MCMC steps which are discarded
thin = 10, # only every thin-th MCMC step will be returned
chains = 2, # number of MCMC chains to be run
trace = 50) # print logging step every trace iteration
# generate draws for holdout period
pggg.xstar.draws1 <- mcmc.DrawFutureTransactions(i, pggg.draws1)
# conditional expectations
i$xstar.pggg <- apply(pggg.xstar.draws1, 2, mean)
# P(active)
i$pactive.pggg <- mcmc.PActive(pggg.xstar.draws1)
# P(alive)
i$palive.pggg <- mcmc.PAlive(pggg.draws1)
# show estimates for first few customers [commenting out for this iterated version]
# head(element1[, c("x", "t.x", "x.star", "xstar.pggg", "pactive.pggg", "palive.pggg")],50)
# report median cohort-level parameter estimates
round(apply(as.matrix(pggg.draws1$level_2), 2, median), 3)
# report mean over median individual-level parameter estimates
median.est1 <- sapply(pggg.draws1$level_1, function(draw) {
apply(as.matrix(draw), 2, median)
})
# get the bits you want in a named vector
z <- round(apply(median.est1, 1, mean), 3)
# convert that named vector of results into a one-row data frame to make collapsing easier
data.frame(as.list(z))
}
# now use lapply to iterate that function over the elements of your list
results <- lapply(sampled_list, thingy)
# now bind the results into a data frame
boundresults <- do.call(rbind, results)
结果(花了一段时间):
k lambda mu tau z
sample_1 4.200 0.174 0.091 102.835 0.27
sample_10 3.117 0.149 0.214 128.143 0.29
sample_11 4.093 0.154 0.115 130.802 0.30
sample_12 4.191 0.142 0.053 114.108 0.33
sample_13 2.605 0.155 0.071 160.743 0.35
sample_14 9.196 0.210 0.084 111.747 0.36
sample_15 2.005 0.145 0.091 298.872 0.40
sample_16 2.454 0.111 0.019 78731750.121 0.70
sample_2 2.808 0.138 0.059 812.278 0.40
sample_3 4.327 0.166 0.116 559.318 0.42
sample_4 9.266 0.166 0.038 146.283 0.40
sample_5 3.277 0.157 0.073 105.915 0.33
sample_6 9.584 0.184 0.086 118.299 0.31
sample_7 4.244 0.189 0.118 54.945 0.23
sample_8 4.388 0.147 0.085 325.054 0.36
sample_9 7.898 0.181 0.052 83.892 0.33
您还可以将最后两个步骤合并为do.call(rbind, lapply(...))
的一行。如果要使结果表中的行名称成为一列,则可以在创建该表后执行boundresults$sample <- row.names(boundresults)
。而且,如果您不喜欢在环境中创建新对象,则可以将该函数放在对lapply
的调用中,即lapply(sampled_list, function(i) { [your code] }).