对于有两个农场的公司,我有一个简单的模型,在每个农场种植两种作物(苹果和梨)。第一步是将树木的数量乘以每棵树上的果实数量。
模拟每棵树上的果实数量(跨农场和农作物)。
在R中对此进行建模时,我至少面临三个决定:
即使我添加了另一个裁剪和/或服务器场,我希望它能够正常工作 - 理想情况下即使我添加了另一个维度,例如作物品种(格兰尼史密斯等)。我还想通过名称而不是索引编号来引用农场和农作物。
这是我提出的方法。它很有用,但很难添加另一个维度,而且代码很多。有更简洁的方式吗?
构造变量:
farms <- c('Farm 1', 'Farm 2');
crops <- c('Pear', 'Apple');
params <- c('mean','sd');
numTrees <- array(0, dim=c(length(farms), length(crops)), dimnames=list(farms,crops));
fruitPerTree <- array(0, dim=c(length(farms), length(varieties), length(params)),
dimnames=list(farms,varieties,params));
# input data e.g.
numTrees['Farm 1', 'Pear'] = 100
# and
fruitPerTree['Farm 1', 'Pear', 'mean'] = 50
模拟:
simNormal2D <- function(dataVar, numSims) {
#
# Generate possible outcomes for dataVar (e.g. fruitPerTree).
# It generates them for each value of the first two dimensions.
#
# Args:
# dataVar: 3-dimensional array,
# with 3rd dim being the normal params
# numSims: integer, e.g. 10000
#
# e.g. sims <- simNormal2D(fruitPerTree, 10000)
#
# Returns:
# a 3D array with 3rd dimension indexing the simulated results
#
dims <- dimnames(dataVar);
sims <- array(dim=c(length(dims[[1]]), length(dims[[2]]), 0),
dimnames=list(dims[[1]], dims[[2]], NULL));
for(x in dims[[1]]) {
for(y in dims[[2]]) {
sim <- rnorm(numSims, dataVar[x, y, 'mean'],
dataVar[x, y, 'sd'] );
sims <- append(sims, sim);
}
}
# R fills array from first arg columnwise, so dims are reversed
sims <- array(sims, c(numSims, length(dims[[2]]),
length(dims[[1]])),
dimnames=list(c(1:numSims), dims[[2]], dims[[1]]));
# reverse them back again
sims <- aperm(sims, c(3,2,1));
return(sims);
}
simFruitPerTree <- simNormal2D(fruitPerTree, numSims);
要乘以simFruitPerTree
和numTrees
,我发现我首先要进行手动广播:
simNumTrees <- array(numTrees, dim=c(length(dims[[1]]), length(dims[[2]]), numSims),
dimnames=list(dims[[1]], dims[[2]], c(1:numSims)));
simTotalFruit <- simFruitPerTree * simNumTrees;
为了比较,在Python中(我比R知道的更好),我可以通过使用元组索引字典以及两个字典理解来执行所有这些步骤,例如:
fruit_per_tree = {}
fruit_per_tree[('Farm 1', 'Pear')] = (50, 15) # normal params
sim_fruit_per_tree = {key: random.normal(*params, size=num_sims)
for key, params in fruit_per_tree.items() }
sim_total_fruit = {key: sim_fruit_per_tree[key]*num_trees[key] for key in num_trees }
R也有简单的方法吗?谢谢!
答案 0 :(得分:1)
以下是我将如何设置这样的模拟:
#for reproducibility
set.seed(42)
#data
farms <- data.frame(farm=rep(1:2, each=2),
trees=sample(100, 4),
crop=rep(c("pear", "apple")),
mean=c(100, 200, 70, 120),
sd=c(30, 15, 25, 20))
#n
n <- 100
#simulation
fruits <- t(matrix(rnorm(n*nrow(farms), farms$mean, farms$sd), ncol=n))
#check
colMeans(fruits)
#[1] 101.10215 200.06649 68.01185 120.05096
library(reshape2)
fruits <- melt(fruits, value.name="harvest_per_tree")
farms$i <- seq_len(nrow(farms))
farm_sim <- merge(farms, fruits, by.x="i", by.y="Var2", all=TRUE)
names(farm_sim)[7] <- "sim_i"
#multiply with number of trees
farm_sim$harvest_total <- farm_sim$harvest_per_tree * farm_sim$trees
head(farm_sim)
# i farm trees crop mean sd sim_i harvest_per_tree harvest_total
# 1 1 1 92 pear 100 30 1 110.89385 10202.234
# 2 1 1 92 pear 100 30 2 145.34566 13371.801
# 3 1 1 92 pear 100 30 3 139.14609 12801.440
# 4 1 1 92 pear 100 30 4 96.00036 8832.033
# 5 1 1 92 pear 100 30 5 26.78599 2464.311
# 6 1 1 92 pear 100 30 6 94.84248 8725.508
library(ggplot2)
ggplot(farm_sim, aes(x=sim_i, y=harvest_total, colour=factor(farm))) +
geom_line() +
facet_wrap(~crop)
答案 1 :(得分:0)
如果我理解正确,那么您正在对来自n个农场的总水果输出进行建模,每个农场都有k种类型的作物(此处,n = k = 2)。每个农场每个品种都有一定数量的树木,对于每个农场,生产力(果实/树)是分布在N(μ,σ)上的随机变量,其中μ和σ取决于农场和品种。
因此,对于输入,我们构建了一个数据框,params
有5列:farm, crop, trees, mean, and sd
。然后,对于给定的农场/作物组合,每行包含树的数量,每棵树的平均生产力以及每棵树的生产力的变化。这些是输入。
如果我们在树级进行建模,那么来自给定农场的给定品种的每棵树的水果输出是:
rnorm(trees,mean,sd)
也就是说,输出是length =#trees的随机样本,其中mean和sd适合给定的品种和农场。然后,该品种/农场的所有树木的总产量只是上面的向量的总和,并且总产出是所有农场/作物的这些总和的总和。
所有这些都为我们提供了蒙特卡洛模型的 1次迭代。要确定总产出的分布,我们必须重复这个过程若干次。幸运的是,在R中这是相当简单的:
set.seed(1)
farms <- c('Farm 1', 'Farm 2')
crops <- c('Pear', 'Apple')
params <- expand.grid(farms=farms,crops=crops)
params$trees<- 100
params$mean <- 50
params$sd <- 10
n.iterations<- 1000
output <- function(i,p) {
pp <- p[3:5] # trees, mean, sd for each farm/crop
# fruit = total output for each farm/crop combination
fruit <- colSums(apply(pp,1,function(x)rnorm(x[1],x[2],x[3])))
return(sum(fruit)) # grand total output
}
dist <- sapply(1:n.iterations,output,params)
print(c(mean=mean(dist),sd=sd(dist)),quotes=F,digits=4)
# mean sd
# 19997.5 198.8
hist(dist, main="Distribution of Total Output",
sub=paste(n.iterations,"Iterations"),xlab="Total Fruit Output")
此代码与农场或品种的数量无关;只需在开头更改farms
和crops
向量。如果并非所有农场都拥有所有品种,请为缺失的品种设置params$trees <- 0
。
我们可以检查n.iterations
的影响如下。此代码只运行100,1000和10,000次完整模拟,并使用ggplot
绘制分布。
gg <- do.call(rbind,
lapply(c(100,1000,10000),
function(n)cbind(n=n,total=sapply(1:n,output,params))))
gg <- data.frame(gg)
library(ggplot2)
ggplot(gg)+
geom_histogram(aes(x=total, y=..density.., fill=factor(n)))+
scale_fill_discrete("Iterations")+
facet_wrap(~n)
最后,我建议你考虑每棵树的输出更可能是泊松分布而不是正常。如果使用rpois(...)
而不是rnorm(...)
重新运行模拟,则整体sd略低(~150而不是〜200)。
答案 2 :(得分:0)
这是我的问题的一般解决方案。我从罗兰的方法开始,并对其进行了更新,以便可以轻松更改分布,参数和尺寸。
distSim <- function(nSims, simName, distFn, dimList, paramList, constList) {
#
# Simulate from a distribution across all the dimensions.
#
# Args:
# nSims: integer, e.g. 10000
# simName: name of the output column, e.g. 'harvestPerTree'
# distFn: distribution function, e.g. rnorm
# dimList: list of dimensions,
# e.g. list(farms=c('farm A', 'farm B'), crops=c('apple', 'pear', 'durian'))
# paramList: list of parameters, each of length = product(length(d) for d in dimList),
# to be passed to the distribution function,
# e.g. list(mean=c(10,20,30,5,10,15), sd=c(2,4,6,1,2,3))
# constList: optional vector of length = product(length(d) for d in dimList)
# these are included in the output dataframe
# e.g. list(nTrees=c(10,20,30,1,2,3))
#
# Returns:
# a dataframe with one row per iteration x product(d in dimList)
#
# expand out the dimensions into a dataframe grid - one row per combination
df <- do.call(expand.grid, dimList);
nRows <- nrow(df);
# add the parameters, and constants, if present
df <- cbind(df, paramList);
if (!missing(constList)) {
df <- cbind(df, constList);
}
# repeat this dataframe for each iteration of the simulation
df <- do.call("rbind",replicate(nSims, df, simplify=FALSE));
# add a new column giving the iteration number ('simId')
simId <- sort(rep(seq(1:nSims),nRows));
df <- cbind(simId, df);
# simulate from the distribution
df[simName] <- do.call(distFn, c(list(n=nrow(df)), df[names(paramList)]))
return(df);
}
样本用法(仅为了简单起见使用正态分布):
dimList <- list(farms=c('farm A', 'farm B'), crops=c('apple', 'pear', 'durian'));
constList <- list(numTrees=c(10,20,30,1,2,3));
paramList <- list(mean=c(10,20,30,5,10,15), sd=c(2,4,6,1,2,3));
distSim(nSims=3, simName='harvestPerTree', distFn=rnorm, dimList=dimList,
paramList=paramList, constList=constList);
示例输出:
simId farms crops mean sd numTrees harvestPerTree
1 1 farm A apple 10 2 10 9.602849
2 1 farm B apple 20 4 20 20.153225
3 1 farm A pear 30 6 30 25.839825
4 1 farm B pear 5 1 1 1.733120
5 1 farm A durian 10 2 2 13.506135
6 1 farm B durian 15 3 3 11.690910
7 2 farm A apple 10 2 10 7.678611
8 2 farm B apple 20 4 20 22.119560
9 2 farm A pear 30 6 30 31.488002
10 2 farm B pear 5 1 1 5.366725
11 2 farm A durian 10 2 2 6.333747
12 2 farm B durian 15 3 3 13.918085
13 3 farm A apple 10 2 10 10.387194
14 3 farm B apple 20 4 20 21.086388
15 3 farm A pear 30 6 30 34.076926
16 3 farm B pear 5 1 1 6.159415
17 3 farm A durian 10 2 2 8.322902
18 3 farm B durian 15 3 3 9.458085
另请注意,您还可以采用精确索引的方式定义输入值;例如如果你定义
numTrees2 <- array(0, dim=c(length(farms), length(crops)), dimnames=tree_dimList);
numTrees2['Farm A','apple'] = 200;
# etc
然后,c(numTrees)
命令其输出的方式与expand.grid
匹配,因此您可以将constList = list(numTrees=c(numTrees2)
传递给distSim
。